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I.  Summary 


A.  Task  Objectives 

Below  we  list  the  original  statement  of  work. 

1 .  Develop  new  statistical  methodologies,  parametric  and  nonpara- 
metric,  which  are  particularly  applicable  to  the  problems  of 
discriminant  analysis,  outlier  detection,  script  matching  and  wave 
form  matching  in  the  context  of  monitoring  nuclear  proliferation. 

2.  Determine  better  methods  for  estimating  statistical  distributions 
which  may  be  used  for  both  discrimination  purposes  and  for 
assessing  system  performance. 

3 .  Develop  a  framework  in  which  the  results  of  monitoring  and  the 
capability  of  the  monitoring  network  can  be  usefully  and  correctly 
stated. 

4.  Apply  the  above  developed  methodology  to  data  at  the  ARP  A 
Center  for  Seismic  Studies  to  assess  the  effectiveness  of  the  above 
theoretical  developments. 

Item  1  was  accomplished  by  developing  several  near  optimal  tests  to  determine 
when  observations  should  be  regarded  as  "unusual."  A  paper  developing  a 
nonparametric  methodology  for  discriminating  between  two  groups  has  been 
accepted  for  publication  in  the  journal  Computational  Statistics  and  Data  Analysis. 
An  additional  paper  extending  our  outlier  detection  methodology  to  the  important 
missing  data  scenario  was  distributed  as  a  technical  report.  Computer  code  to 
implement  these  results  can  be  obtained  upon  request  by  contacting  Dr.  H.L. 

Gray,  Department  of  Statistical  Science,  Southern  Methodist  University. 

In  addition,  the  results  in  outlier  detection  were  extended  to  two  outliers  ftom  a 
mixture.  For  example,  this  latter  test  would  allow  one  to  test  for  an  outlier  from  a 
training  set  made  up  of  mining  explosions  and  earthquakes,  rather  than  just  one  or 
the  other.  Although  this  latter  work  is  not  100%  complete,  it  is  to  the  state  of 
completion  that  it  can  be  used  in  most  settings.  The  code  for  this  program  has  also 
been  passed  on  to  MRC  and  is  available  upon  request  by  contacting  Dr.  Gray.  A 
paper  on  this  new  outlier  detection  is  being  prepared  for  submission  for 
publication. 

Other  methodologies  were  also  developed.  Although  the  theory  is  basically  now 
developed,  these  latter  methodologies  are  not  ready  for  distribution. 


V 


Regarding  items  2  and  3,  the  bootstrap  methodology  was  introduced  to  effectively 
solve  both  of  those  problems.  To  satisfy  item  4,  the  outlier  method  developed 
under  this  contract  was  applied  to  nuclear  explosions,  mining  blasts,  and 
earthquakes  in  diverse  geological  regions  recorded  by  the  ARCESS  and  GERES  S 
arrays,  CDSN  station  WMQ,  and  LNN  stations  ICNB  and  MNV.  Most  such  tests 
were  run  at  MRC  although  some  were  also  performed  at  SMU.  At  the  .01 
significance  level,  between  90-100%  of  the  nuclear  explosions  and  quarry  blasts 
were  detected  as  outliers  of  the  earthquake  groups  in  the  various  regions.  Overall, 
209  of  229  (91%)  explosions  were  detected  and  there  were  only  2  false  alarms  out 
of  143  earthquakes  (1 .4%),  not  significantly  higher  than  the  targeted  1%.  These 
results  were  obtained  for  diverse  regions,  for  a  wide  range  of  epicentral  distances 
and  magnitudes,  and  for  single  stations  and  arrays.  The  methodology  is,  of  course, 
applicable  to  multiple  stations,  as  well. 

The  application  of  the  outlier  detection  method  to  data  from  multiple  stations  was 
explored  in  detail,  due  to  the  concern  that  some  data  compression  might  be 
required.  Various  data  compression  methods  were  considered  and  it  was  ultimately 
decided  that  with  proper  computer  code  the  so-called  "full  vector"  MLE  outlier 
method  was  preferable  to  any  compression  methods.  This  problem  is  discussed  in 
detail  in  the  paper  "Outlier  Tests  with  Multiple  Stations"  which  is  included  in  the 
appendix. 

In  general,  we  feel  this  work  has  been  very  successful  and  when  the  methodology 
we  are  currently  developing  is  complete,  we  feel  that  the  statistical  methodology 
developed  will  be  nearly  optimal  for  automated  detection  of  suspicious  events, 
while  at  the  same  time  furnishing  the  user  with  reliable  estimates  of  the  associated 
error  rates. 
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A  Bootstrap  Gleneralized  Likelihood  Ratio  Test  in  Discriminant  Analysis 
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Abstract 

A  generalized  likelihood  ratio  test  is  developed  for  classification  in  two  populations 
when  one  needs  to  control  one  of  the  probabilities  of  misclassification.  The  proposed 
classification  procedure  is  constructed  by  applying  the  parametric  bootstrap  to  the 
generalized  likelihood  ratio.  There  are  known  methods  for  controlling  this 
misclassification  probability  for  the  case  where  normal  distributions  with  the  same 
covariance  matrix  are  assumed.  Our  approach,  however,  can  be  applied  to  not  only  this 
case  but  to  the  case  of  normal  distributions  with  different  covariance  matrices  and  the 
case  of  a  mixture  of  discrete  and  continuous  variables. 

The  results  given  here  do  not  depend  on  normality  but  can,  in  fact,  be  applied  to 
any  distribution  for  which  the  maximum  likelihood  estimates  exist.  We  do,  however, 
restrict  our  simulation  of  these  results  to  the  normal  distribution  if  the  variates  are  all 
continuous.  Three  cases  are  simulated:  normal  distributions  with  equal  covariance 
matrix,  normal  distributions  with  unequal  covariance  matrices,  and  mixture  of 
categorical  and  normal  variables.  An  application  to  classifying  seismic  events  is 
presented. 

Keywords:  Bootstrap,  hypothesis  testing,  discriminant  analysis,  mixtures  of  continuous 
and  discrete  variables,  mixed  variables 
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1.  Introduction 


One  of  the  primary  problems  associated  with  monitoring  worldwide  nuclear 
proliferation  is  the  problem  of  distinguishing  seismically  between  small  earthquakes  and 
explosions.  Although  the  statistical  problem  appears  to  be  one  of  discriminant  analysis, 
it  is  actually  one  of  testing  hypotheses  since  the  political  and  physical  environment  will 
usually  require  one  of  the  errors  to  be  preassigned. 

Classical  approaches  for  discriminant  analysis  in  two  populations  depend  on  the 
ratio  of  the  probabilities  or  probability  density  functions.  The  classification  rule  based 
on  the  ratio  is  optimal  in  the  sense  that  it  minimizes  the  total  probability  of  mis- 
classification  (Welch  1939).  Under  the  assumptions  of  normality,  equal  covariances,  and 
unknown  parameters  for  the  variables,  Anderson  (1951)  derived  a  classification  rule 
based  on  the  linear  discriminant  function,  which  is  known  as  Anderson’s  W  statistic,  by 
substituting  estimates  for  the  parameters  in  the  ratio.  When  the  covariance  matrices  are 
not  equal,  replacing  each  parameter  by  its  estimate  gives  the  classical  quadratic 
discriminant  function  (Seber,  1984,  p297;  Anderson,  1984,  p235). 

Among  other  classification  rules  is  a  hypothesis-testing  approach  which  is  derived 
by  obtaining  the  generalized  likelihood  ratio.  This  rule  based  on  the  assumption  of 
normal  distributions  with  equal  covaxiance  matrices,  was  proposed  by  Anderson  (1958), 
studied  by  John  (1960,  1963),  and  has  become  known  as  John’s  Z  statistic.  Krzanowski 
(1982)  extended  this  approach  to  mixed  discrete  and  continuous  variables.  For  more 
discriminant  procedures  in  the  mixture  case,  see  Knoke  (1982),  Krzanowski  (1975,  1979, 
1980),  and  Tu  and  Han  (1982). 

Most  of  these  classical  classification  rules  allocate  the  individual  to  be  classified  to 
one  of  the  populations  if  the  ratio  is  less  than  a  cut-off  point  c,  and  to  the  other 
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otherwise.  The  cut-off  point  c  is  usually  based  on  the  probabilities  of  drawing  an 
observation  from  the  individual  populations  and  the  costs  of  misclassification. 
Associated  with  these  procedures  are  the  resulting  misclassification  probabilities. 
When,  as  in  the  problem  of  interest  here,  it  is  important  to  fix  one  of  these  probabilities 
of  misclassification,  the  statistician  will  need  to  determine  the  cut-off  point  to  allow  this 
probability  of  misclassification  to  be  prespecified. 

When  this  probability  is  prespecified  the  problem  then  becomes  one  of  testing  a 
hypothesis.  However,  because  of  the  setting  of  this  problem  we  shall  continue  to  refer 
to  it  as  a  classification  problem.  When  the  p-dimensional  characteristic  variable  V~ 
J\r  2)  for  a  population  ij,  V  ~  E)  for  another  population  and 

52  unknown,  Anderson  (1973)  and  Kanazawa  (1979)  obtained  the 

asymptotically  normal  expansion  of  the  distribution  of  statistics  W  and  Z  respectively, 
which  are  used  to  find  the  cut-off  point  for  a  fixed  value  of  the  particular  misclassi¬ 
fication  probability.  In  other  cases  (for  example  12^^^  or  V  not  normal)  the 

asymptotic  distribution  of  the  classification  statistics  is,  in  general,  unknown  so  that  no 
hypothesis  test  is  available. 

In  this  report  we  determine  a  test  of  the  classification  hypothesis  that  satisfies  the 
following  requirements; 

i)  52^*^^  is  not  necessarily  equal  to  52^^^ 

ii)  The  p-dimensional  discriminant  variable  may  be  a  mixture  of 
continuous  and  discrete  variables 

iii)  The  continuous  variables  need  not  be  normally  distributed. 

Examples  of  continuous  discriminants  that  are  commonly  used  in  the  nuclear 
monitoring  setting  are  ratios  of  amplitudes  or  spectra  for  different  time  windows  and 
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frequency  bands  of  tbe  observed  seismogram.  Earthqualces  typically  generate  more 
sbear  energy  than  compressional  energy,  while  explosions  usually  have  much  more 
compressional  energy  than  shear.  Since  compressional  waves  propagate  faster  than 
shear  elastic  waves,  this  leads  to  larger  relative  amplitudes  in  different  time  windows  for 
the  two  source  types.  Although  explosive  devices  are  expected  to  have  more  intrinsic 
high  frequency  content  than  earthquakes,  explosions  are  usually  shallower,  in  more 
anelastic  materials  than  the  deeper  earthquakes,  which  tends  to  attenuate  the  high 
frequency  content.  As  a  result,  spectral  ratios  of  particular  portions  of  the  seismograms 
axe  useful  discriminants  in  some  regions  of  the  world. 

Some  examples  of  categorical  variables  that  are  commonly  used  are  presence  of 
cepstral  peaks,  regional  seismicity  (high/low),  location  (off-shore/on-shore),  depth 
(deep/shallow),  and,  in  the  context  of  associating  mine  blasts  with  a  particular  mine, 
day  of  the  week. 

The  inability  to  treat  a  mixture  of  discrete  and  continuous  variables  rigorously  in 
this  setting  has  limited  the  application  of  many  statistical  classification  methods  in  the 
past.  This  has  led  to  rule-based  approaches  (Sereno  and  Wahl,  1993)  which  are 
somewhat  ad  hoc,  artificial  intelligence  approaches  (Baumgardt,  et  al,  1992),  or 
inappropriate  applications  of  hnear  discriminant  functions  or  chi-squared  tests.  It  is 
vital,  however,  for  monitoring  applications  that  these  issues  are  all  addressed  with 
statistical  rigor  so  that  the  error  rates  involved  have  meaning.  The  classification 
method  proposed  here  satisfactorily  addresses  this  problem  by  applying  the  bootstrap  to 
the  generalized  hkelihood  ratio.  Although  this  method  is  actually  a  test  of  hypothesis, 
it  could  just  as  well  be  used  as  a  method  for  classification  in  the  classical  sense  with  the 
bootstrap  being  used  to  determine  the  probabilities  of  misclassification.  For  additional 
discussion  of  procedures  for  classifying  seismic  events  see  Shumway(1988). 

In  Section  2,  we  discuss  the  motivation  for  the  proposed  bootstrap  Hkelihood  ratio 
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classification  procedure,  show  how  to  construct  the  bootstrap  likelihood  ratio  statistic, 
and  explain  how  to  determine  the  cut-off  point  for  a  desired  misclassification 
probability.  Section  3  is  devoted  to  the  application  of  the  procedure  to  three  cases.  In 
Example  1,  the  bootstrap  likelihood  ratio  statistic  is  shown  to  perform  almost  as  well  as 
the  statistics  W  and  ^  which  are  specifically  designed  for  Example  1,  i.e.  the  case  where 
two  normal  distributions  with  the  same  covariance  matrix  axe  considered.  The  bootstrap 
also  performs  quite  well  for  both  the  normal  case  with  different  covariance  matrices 
{Example  2)  and  the  case  of  a  mixture  of  continuous  and  discrete  variates  {Example  3), 
where,  in  either  case,  classical  classification  rules  cannot  control  the  probability  of 
misclassification  since  their  limiting  distributions  axe  unknown.  In  Example  4  we  apply 
the  results  developed  here  to  some  real  seismic  discriminant  data  and  in  Section  4  we 
present  some  concluding  remarks. 

2.  Bootstrap  Gleneralized  Likelihood  Ratio  Test  for  Classification 
2.1.  Motivation 

Let  V'  =  ( Vj,  .  .  .  ,Vp)  be  a  p-dimensional  random  vector  which  is  used  to  classify 
an  individual  into  either  population  jtq  or  population  For  i  —  0,  1,  let  /j(v  |  be 
the  probability  or  probability  density  function  of  V  evaluated  at  v,  if  v  comes  from 
population  w-,  where  0^^^  is  the  set  of  unknown  parameters.  The  components  of  V  may 
be  all  discrete,  all  continuous,  or  mixture  of  discrete  and  continuous  variables.  In  the 
mixed  variables  case,  for  example,  let  =  (Y,  X)  with  Y  =  (Yj, . . .  ,Y^  and  X  =  Xj, . 
. .  ,Xp.jt5  where  Y^,  . . .  yYj.  are  discrete  and  Xj, . . .  are  continuous.  Suppose  Y  has 
the  probability  conditional  probability  density  function  of  X  given  Y 

is  fi  x|  y(^I^x]y’  probability  density  function  of  V  in  ir-  is  given  by 
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where  ^  =  0,  1.  See  Olkin  and  Tate  (1961)  for  the  mixture  of  the 

multinomial  and  the  multivariate  normal  distributions. 

For  any  given  classification  rule,  suppose  that  the  region  is  such  that  v  € 
implies  that  v  is  classified  as  belonging  to  Further  assume  that  /?q  n  =  0. 
The  respective  probabilities  of  misclassification  are 

J^llO)  =  /  /o(v  I  »<“>)  * 

h 

/>(0|1)  =  j  /i(v  I  «('))  At, 

where  dv  =  dv-^.  .  .  dv^.  The  classical  classification  rules  obtain  the  optimal  regions  Rq 
and  i?i  based  on  /o(v  |  0^®))  //i(v  j  according  to  their  classification  principles  (such 
as  minimization  of  the  total  probability  of  misclassification,  minimization  of  the  total 
cost  of  misclassification,  maximization  of  the  posterior  probability,  minimax  classifica¬ 
tion,  etc.).  However  under  any  one  of  these  classification  principles,  neither  P(ll0)  nor 
P(0ll)  is  fixed  in  advance  at  a  certain  value,  which  here  we  desire. 


2.2.  Bootstrapping  the  Log  Likelihood  Ratio  Test  Statistic 

Suppose  we  have  the  training  samples  of  size  Nq,  and  {v^^\ 

.  .  .  ,  v^^}  of  size  Ni  from  ttq  and  ttj,  respectively.  A  new  observation  whose  value 
is  V  must  be  classified  as  from  either  Tq  or  7c^.  Now  we  employ  a  hypothesis-testing 
approach  to  classify  v.  That  is,  the  classification  of  v  is  accomplished  by  testing  the 
hypothesis 


Hn:  v,  v' 


eiTi 
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Hi:  . . . ,  G  ttq  ;  V,  •  •  •  >  ^  ’^1- 

We  use  the  generalized  likelihood  ratio  method  to  construct  a  test.  The  likelihood  of  the 
two  training  samples  is  given  by 


!(«<">,  *(')  I  vf), . . .  ,Tf ,  yl‘> . vS}>)  =  n  /o(vf >  I  «<“>)  J1  /i(v5^>  I  «<'')•  (2) 

0  1  j=l 


Consider  now  the  new  individual  v  to  be  classified.  If  this  individual  is  included 
with  the  training  sample  from  tt,,  then  an  extra  multiplying  factor 

I  v)  =  /.(y  I  «'■'*) 

must  be  incorporated  in  (2).  The  generalized  likelihood  ratio  is  therefore  either  unity  or 
given  by 


LR 


sup^^(o)  ^(1)  I  I  I  .  vSJ^\  . . . ,  vj)))  } 

SUP^^(O)  ^(1)  I  I  1  •  •  •  ’  •  •  •  ’ 


Io(gf  I  v)  I  vf .  •  •  •  >  . 

I  v)  L{e^\  I  vj®),  .  .  .  ,  vjl),  .  .  .  ,  vj))) 


(3) 


where  is  the  Maximum  Likelihood  Estimator  (MLE)  of  6^^  under  Hq  and  is  the 
MLE  of  under  Hi,  «  =  0,  1.  Now  let  A  =  log(Li2).  It  intuitively  follows  that  small 
values  of  A  provide  evidence  against  Hq  and  thus  the  generalized  likelihood  ratio  test  is 
to  reject  Hq  if  A  <  A^^,  where  Aq,  is  chosen  to  provide  a  size  a  test. 

Let  P{\  <  Aq,  I  Hq)  denote  the  size  of  the  T3rpe  I  error  and  P(A  >  Aq  [  Hi)  denote 
the  size  of  the  Type  II  error  for  a  constant  Aq.  Then  P(A  <  Aq  |  Hq)  is  the  probability 
of  misclassification  P(ll0),  and  P(X  >  Aq  |  Hi)  is  the  probability  of  misclassification 
P(0|1)  when  Pq  and  Pi  are  defined  in  terms  of  Aq.  Therefore  we  can  construct  a 
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classification  rule  which  can  control  one  of  the  probabilities  of  misclassification  by  fixing 
the  size  of  the  test  if  we  know  the  distribution  of  A(V,  .  . .  ,V^^, 

In  most  cases  it  is  difficult  to  obtain  the  exact  distribution  of  the  test  statistic  A.  The 
distribution,  however,  can  be  approximated  by  employing  the  bootstrap  method  (Efron 
1979,  1982). 

Since  the  form  of  the  probability  density  function  is  assumed  known,  the 
bootstrap  samples  can  be  obtained  from  the  estimated  density  function.  This  is  called 
the  parametric  bootstrap  (Efron  1979),  and  we  employ  it  in  this  study.  We  have 
examined  the  use  of  the  nonparametric  approach  of  resampling  with  replacement  from 
the  training  samples,  and  for  the  training  samples  of  size  25  or  larger,  this 
nonparametric  bootstrapping  yielded  similar  results  to  those  reported  here. 

The  likelihood  ratio  statistic  for  the  test  of  the  null  hypothesis  Hq  versus  the  alter¬ 
native  Hi  can  be  parametrically  bootstrapped  as  follows.  Given  the  training  samples 

bootstrap  samples  generated  randomly 

from  /o(v  I  01^°^)  and  /^(v  |  respectively,  where  and  axe  obtained  from 

the  original  samples  and  respectively.  The  value  of  A,  to  be  denoted 

A*,  is  computed  for  the  bootstrap  samples  by  substituting  {vj^  \  .  .  .  ,v^^^}, 

for  v,  . . .  ,v55)},  in  (3),  respectively.  This  process 

is  repeated  independently  B  times,  and  the  replicated  values  of  A  ,  {Aj}y_i,  evaluated 
from  the  successive  bootstrap  samples,  can  be  used  to  assess  the  true  null  distribution  of 
A.  In  particular,  the  oth  empirical  quantile  of  {A|}^1,  denoted  by  A^,  will  essentially 
approach  Aq,,  the  true  critical  value  for  the  test  of  size  a,  for  large  Nq  and  iVj  as  B 
tends  to  infinity.  (See  Bickel  and  Freedman  (1981)  for  some  asymptotic  theory  on  the 
quantile  process  for  the  bootstrap.).  Thus  we  use  A^  as  a  critical  value  for  the  test  of 
size  a.  Therefore,  we  allocate  v  to  if  A  <  A^,  and  allocate  v  to  ttq,  otherwise. 

McLachlan  (1987)  showed  the  relationship  between  A^  and  the  bootstrap 
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replication  size  B  for  the  specified  test  size  a.  In  general,  given  a  set  of  B  order 
statistics  from  a  population,  the  probability  that  a  randomly  selected  member  from  the 
population  is  less  than  or  equal  to  the  jth  order  statistic  is  j/(^B-\-l).  Thus,  if 
a  =  j/{B+l),  then  is  the  ^h  smallest  value  of  {Af}^j,  i.e.  if  a  =  0.05  and  B  =  299 
then  A^  is  the  15th  smallest  value  of  {A*}|£p 

3.  Applications 

The  bootstrap  generalized  likelihood  ratio  test  proposed  here  allows  the  p- 
dimensional  characteristic  variable  V  to  be  discrete,  continuous,  or  a  combination  of 
discrete  and  continuous  variables,  and  its  probability  or  probability  density  function 
fi(V  I  0^*^)  for  TTj  is  assumed  to  be  known  except  for  the  value  of  the  parameter  i  = 
0,  1.  It  can  therefore  be  applied  to  the  classification  problem  in  each  of  these  cases 
when  one  needs  to  control  one  of  the  probabilities  of  misclassification.  As  we  will  see, 
the  bootstrap  generalized  likelihood  ratio  test  essentially  achieves  the  required 
probability  of  misclassification  for  even  a  moderate  size  sample.  Throughout,  we  assume 
that  we  have  random  samples  {v^®^}^]_  from  ttq,  and  from  ttj^. 

In  the  following  four  examples  we  consider  four  distinct  scenarios.  In  the  first 
example  we  consider  the  simple  case  where  the  observations  are  all  normal  with  equal 
covariances.  Of  course  this  case  is  well  established,  but  we  consider  it  to  demonstrate 
that  very  little  is  lost  by  using  the  bootstrap  rather  than  the  exact  distribution.  In 
Example  2,  we  continue  to  assume  normality  but  drop  the  assumption  of  equal 
covariances.  In  this  case  the  bootstrap  is  necessary  in  order  to  determine  the  proper 
critical  point.  However,  it  is  not  necessary  to  bootstrap  the  likelihood  ratio,  but  instead 
one  could  bootstrap  the  quadratic  discriminant  function,  Q.  This  example  demonstrates 
that  these  two  bootstrap  approaches  yield  essentially  the  same  result.  In  Example  3  we 
consider  a  mixture  of  normal  and  binomial  variates  where,  to  our  knowledge,  no  alter- 
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native  to  the  method  introduced  here  is  available.  Finally,  in  Example  4  we  consider  a 
set  of  real  data  which  is  treated  as  a  mixture  of  normal  and  multinomial  data. 

Example  1:  Normal  Distributions  with  Equal  Covariance  Matrix 

Suppose  that  /j(v  |  6^*^)  is  the  density  function  for  with 

(=  E),  where  ,E).  Replacing  the  unknown  parameters  in  /0(v  |  {iP\  E))//2(v  [ 

{lP\  E))  by  their  estimates  leads  to  the  well-known  Anderson’s  W  statistic  (A2)  given 
in  the  appendix.  The  Ukelihood  ratio  (A3)  is  characterized  by  John  s  Z  statistic  (A4). 
On  the  other  hand,  the  log  likelihood  ratio  statistic,  A,  is  given  in  (A4)  and  is  obtained 
directly  by  taking  the  log  of  the  expression  (A3)  and  dividing  it  by  a  constant.  The 
monotonic  relationship  between  Z  and  A  is  obvious.  If  the  values  of  W,  Z.,  and  A  are 
greater  than  their  cut-off  points,  then  is  favored  for  v,  and  is  preferred  otherwise. 

Now  we  want  to  choose  the  cut-off  point  so  that  one  probability  of 
misclassification  is  controlled.  Let  oc  be  the  desired  P(1|0).  Anderson  (19^3)  has 
obtained  from  the  asymptotic  normal  distribution  of  W,  the  following  approximate  cut¬ 
off  point  Wq,  which  attains  the  desired  probability  a  to  within  0(iv2).  For  large  Nq 
and 

where  N=Nq  +  N^-2,  D  =  ^|  Uq  is  such  that  $(tto)  =  o:, 

and  ^  (•)  is  the  cumulative  JV(0,  1)  distribution  function.  Kanazawa- (1979)  has 
obtained  the  asymptotic  cut-off  point  Z^  for  the  Z  statistic.  For  large  Nq  and  N■^^, 

Za==  lr>^  +  D[uQ  +  ^{ul  +  Dv^-{p-l)) 
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where  D  and  uq  axe  the  same  as  above. 

Instead  of  deriving  the  limiting  distribution,  the  cut-off  point  of  the  bootstrap 
log  likelihood  ratio  statistic  A  is  obtained  by  the  parametric  bootstrap  procedure 
described  in  Section  2.2.  Using  the  MLEs  of  /i^^\  and  S  from  the  training  samples 
{vW}j^l,  i  =  0,  1,  bootstrap  samples  are  generated  from  a 

Np(v(®),  A/{Nq  +  Ni))  and  a  N,(v^^\  A/(iVo  +  ^i))^  respectively  where  A  is  defined  in 
the  appendix.  We  compute  the  value  of  the  log  likelihood  ratio  statistic,  A 
corresponding  to  (A4),  for  the  bootstrap  samples  by  replacing  v,  v^^^,  v^^\  S  by 
•y*(0)^  respectively,  where  v*^^^  =  *=0?  U  and  S*  is  calculated 

according  to  (Al)  for  the  bootstrap  samples.  This  process  is  repeated  independently  B 
times.  Then  A^  is  the  ath  empirical  quantile  of  where  are  the  values 

of  A*  evaluated  from  the  successive  bootstrap  samples. 

For  given  a,  let  Pi.f;{l|0),  P^1|0)  and  P;^(ll0)  be  the  probabilities  that  the  new 
individual  is  misclassified  into  by  the  statistics  W,  2^  and  A  using  the  cut-off  points 
Wa,  Zee,  Aa,  respectively.  Then  P]/p{l\0)  =  P{W  <  Wq-I^tq),  P^CllO)  =  P{Z  <  ^ako); 
and  P;^(1|0)  =  P(A  <  A^  |  ttq).  We  will  examine  how  close  Ppy{ll0),  P^illO)  and 
Pj^(l|0)  are  to  the  desired  misclassification  probability,  a  =  P(1|0),  for  the  normal 
distributions  with  equal  covariance  matrix  by  Monte  Caxlo  method.  We  generate  two 
sets  of  random  samples  from  S)  and 

S),  respectively,  where 


For  each  i  =  1,  2,  .  .  .  ,  M,  we  obtain  the  values  of  the  statistics  W,  Z,  \,  say  W^,  Z^, 
A^,  using  {Vj,  {v(p}^l},  and  compare  them  to  their  corresponding  critical 
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values  Z^^,,  for  a  fixed  a.  B  =  499  bootstrap  samples  are  used  for  Af^. 

Then  Pp^l|0),  P^l\^)  and  P;^(1|0)  are  estimated  by  the  proportion  of  times  that  the 
value  of  the  statistic  is  less  than  or  equal  to  its  critical  value  among  M  trials.  Since 
P{^1|0)  is  the  usual  estimate  of  a  proportion,  its  standard  deviation  (s.d.)  is  estimated 
by^j  Pj^l|0)(l  —  Pf^l[0))/M.  The  standard  deviation  estimates  of  P^1|0)  and  Pj^(l|0) 
are  obtained  similarly.  The  first  portion  of  Table  1  shows  the  estimates  of  the 
probability  of  misclassification  with  their  standard  deviations  (s.d.)  for  the  different 
sample  sizes  with  oc  =  0.05,  M  =  10,000.  The  results  for  Pj^l|0)  and  P ^\\Qi)  are 
identical  when  IVq  =  JVj  =  25  since  .^  =  (lVQ/(iVo  +  1))  W^for  JVq  =  Although  for  the 
sample  sizes  considered,  the  bootstrap  estimate  does  not  attain  the  same  precision  as 
the  W  ox  Z  statistic’s  estimate,  it  is  clearly  competitive. 


Table  1.  The  estimates  of  the  probability  of  misclassification,  P(1|0)  =  0.05, 
and  the  estimates  of  the  power,  P(l|l) 


iVo  =  =  25 

0.054 

0.054 

.  0.061 

(0.002) 

(0.002) 

(0.002) 

II 

II 

0.055 

0.055 

0.060 

(0.002) 

(0.002) 

(0.002) 

Nq  =  30,  =  45 

0.726 

0.725 

0.736 

(0.004) 

(0.004) 

(0.004) 
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Now  we  compare  the  powers,  P(l|l),  for  W,  Z  and  A.  Random  samples 

generated  from  S)  and  S), 

■  respectively  with  the  same  parameters  as  above.  The  power  estimates  for  W,  Z  and  A, 
Ppj;(lll),  and  P;^(lll),  axe  obtained  in  the  same  way  as  for  P^1|0), 

and  P^(llO),  respectively.  For  a  =  0.05,  Nq  =  30,  =  45,  M  =  10,000  and  B  =  499, 

the  power  estimates  are  similar  to  each  other  with  the  bootstrap  being  slightly  better 
(undoubtedly,  due  to  the  slightly  larger  critical  region)  as  shown  in  the  second  portion 
of  Table  1. 

Example  2:  Normal  Distributions  with  Unequal  Covariance  Matrices 

Let  ttq  and  be  and  with  ^  and 

^  When  the  parameters  are  unknown,  a  classical  classification  rule  known  as  the 
quadratic  discriminant  function  is  obtained  by  taking  the  log  after  substituting 
estimates,  v^®\  v^^^,  and  of  and  into  the  ratio  of  the  two 

multivariate  normal  probability  density  functions,  /q(v  /fi(v  | 

The  quadratic  discriminant  function  Q  is  given  in  (A5),  and  v  is  classified  to  Xq  if  ^  > 
0  and  to  otherwise.  The  probabilities  of  misclassification  of  Q  are  difficult  to  control 
since  even  its  limiting  distribution  is  unknown. 

Following  the  hypothesis-testing  approach  of  (2),  the  MLEs  of 
under  Hq  and  are  given  in  the  Appendix.  The  log  likelihood  ratio  statistic,  A,  is 
given  in  (A6),  and  to  evaluate  the  cut-off  point  A^,  for  the  desired  probability  of 
misclassification,  P(llO)  =  a,  we  generate  bootstrap  samples 

from  a  Nj,  (v^®\  ^  Np(v^^^,  respectively.  Following  the  same 

bootstrap  procedure  as  in  Example  1,  the  ath  empirical  quantile  A^  is  obtained  from  the 
values  of  the  log  likelihood  ratio  statistic  A  for  the  successive  bootstrap  samples.  The 
bootstrap  generalized  likelihood  ratio  classification  rule  with  misclassification 
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probability  P(1|0)  =  a  is,  therefore,  to  assign  v  to  if  A(v)  <  A^,  and  to  Xq, 
otherwise. 

Consider  two  bivariate  normal  distributions  where 


Suppose  we  apply  the  Q  statistic  for  classification  using  the  usual  classification  rule,  i.e. 
V  is  classified  to  ttq  if  Q  >  0  and  to  ttj  otherwise.  The  probability  of  misclassification  of 
interest,  i.e.  Pq(1|0)  is  P(Q  <  0  Ittq).  In  order  to  determine  the  probability  of  this 
classification  error  we  conduct  a  simulation.  We  generate  {vj,  and 

N2(/t^^^,  respectively.  We  obtain  the  Q 

statistics  for  {vj,  {vW}yJi},  z  =  1,  2,  .  .  .  ,  M,  and  denote  these  Qp  Q2,  ■  •  ■  , 

Qj^.  Then  Pq(1|0)  is  estimated  by  Pq(IIO)  which  is  the  proportion  of  values  that  are 
less  than  or  equal  to  zero.  Pq(1|0)  (with  its  standard  deviation)  is  0.274  (0.004)  for  Nq 
=  100,  Ni  =  150  and  M  =  10,000.  When  it  is  important  to  keep  the  probability  of 
misclassification  Pq(110)  small,  an  error  this  large  may  be  unacceptable,  resulting  in  the 
need  for  the  method  we  axe  describing. 

Now  we  consider  the  log  likelihood  ratio  statistic  A.  First,  we  would  like  to  know 
how  well  the  parametric  bootstrap  procedure  approximates  the  true  null  distribution  of 
A.  Since  the  true  null  distribution  of  A  is  not  known,  we  generate  samples  {Vj, 
{vW}"0j}gj  from  NjC/"*,  S<“>)  mi  S<'))  with  M  = 

100,000.  Applying  {vj,  {yW  )^i,  to  (A6),  we  can  obtain  The 

true  ntdl  cumulative  distribution  function  (cdf)  of  A  is  approximated  by  the  empirical 
cdf  using  {A  Jgi  for  {Nq,  N^)  =  (10,  15),  (%  ^1)  =  (30,45),  and  {Nq,  N^)  =  (100,  150). 
The  true  critical  value  A^^  is  approximated  by  -1.900,  -1.504,  -1.353  respectively.  These 
are  the  ath  quantiles  of  {Aj}^l  where  a  =  0.05  for  {Nq,  N^)  =  (10,  15),  {Nq,  N^)  =  (30, 
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45),  and  (JVq,  N{)  =  (100,  150),  respectively.  In  this  simulation,  B  =  299  is  used  for 
the  bootstrap  replication  size  because  of  computer-time  constraints.  Our  investigation 
indicates  that  the  results  using  B  =  299  and  B  =  499  are  similar. 

For  a  set  of  random  samples  {v,  under  Hq  with  {Nq,  N-^)  = 

(10,  15),  {Nq^  N{)  =  (30,  45),  and  {Nq,  N^)  =  (100,  150),  the  empirical  null  distribution 
of  the  bootstrap  log  likelihood  statistic  using  with  B  =  299  is  also  plotted 

around  the  true  null  cdf  in  Figure  1.  Inspection  of  this  figure  shows  that  the  bootstrap 
null  distribution  approximates  the  true  null  distribution  of  the  log  likelihood  ratio  stat¬ 
istic  quite  well  as  the  sample  sizes  increase  and  does  surprisingly  well  for  small  samples. 

Even  though  the  null  distribution  of  the  Q  statistic  is  unknown,  the  cut-off  point, 
Qa>  for  misclassification  probability,  P(1|0)  =  a,  can  be  approximated  by  the  same 
parametric  bootstrap  procedure  as  for  A.  That  is,  we  evaluate  the  Q  statistic  for  B 
successive  bootstrap  samples  and  call  them  . .  .  ,0%.  Then  is  approximated 

by  Qa,  the  ath  empirical  quantile  of  Therefore,  one  can  allocate  v  to  if  Q 

<  Qa,  and  allocate  v  to  ttq,  otherwise. 

With  the  same  simulation  data  used  to  get  Pq(1|0)  =  0.274  above,  P (^^(llO)  (s.d.), 
the  estimate  of  a  fixed  P(1|0)  =  0.05  by  the  parametrically  bootstrapped  Q  statistic  QB, 
is  0.050  (0.002)  for  B  =  499.  Pj^(llO)  (s.d.)  of  the  bootstrapped  A,  i.e.  A*,  is  0.049 
(0.002)  for  the  same  bootstrap  samples  as  for  QB.  Both  bootstrap  estimates  are  close  to 
the  true  fixed  misclassification  probability  P(ll0)  =  0.05. 

To  further  compare  the  two  tests  we  now  investigate  their  respective  powers, 
P(l|l),  for  different  parameter  values.  Consider  bivariate  normal  distributions 
S^®^)  and  N2(/x^^\  Let  pq  and  pi  be  the  correlation  coefficient  for  N2(/i^*^^,  B^®^) 

and  B^^^)  respectively.  We  assume  that  Pq  =  0.5,  pi  —  -0.5  and  that  both 

distributions  have  the  same  marginal  variances,  0^  =  1  and  <T2  =  1.  That  is. 
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For  =  (0,  0)',  we  examine  the  power,  P(l|l)  of  the  bootstrap  Q  statistic  and  the 
bootstrap  A  statistic  at  +  A((Ti,  03)',  A  =  1,  2,  3,  for  small  samples  (Nq  = 

10,  =  15)  and  for  large  samples  (Nq  =  100,  iVj  =  150).  For  each  A  =  1,  2,  3  under 

Hj,  we  randomly  generate  from  and  {v,-, 

from  N'2(fi^^\  S^^))  with  ^o  =  10^  iV^  =  15  and  M=  10,000.  For  each  i  =  1,  .  .  .  ,Mand 
for  a  =  0.05,  {v-,  {v^}^!,  is  used  for  the  parametric  bootstrap  to  obtain  the 

cut-off  points  Qa  and  A^  for  QB  and  A,  respectively.  The  bootstrap  replication  size  B 
used  here  is  499.  Then  the  power  estimate  Pq5(1|1)  for  QB  is  the  proportion  of  times 
that  the  Q  statistic  value  is  less  than  or  equal  to  Qa  out  of  M  trials.  The  power 
estimate  •P;^(l|l)  for  the  bootstrap  A  is  obtained  similarly.  Pqb{'^\1)  and  -Pj^Clll)  are 
listed  along  with  those  for  large  samples  (IVq  =  100,  N-^  =  150)  in  Table  2. 


Table  2.  Power  comparison  between  the  bootstrap  A  and  the  bootstrap  Q  ( QB)  with 
B  =  499.  Entry  is  power  estimate  with  its  standard  deviation. 


A  =  1 

A  =  2 

A  =3 

0.310  (0.0046) 

0.302  (0.0046) 

=  10,  =  15 

0.815  (0.0039) 

0.795  (0.0040) 

0.992  (0.0009) 

0.990  (0.0010) 

p^m 

N,-. 

0.375  (0.0048) 

0.376  (0.0048) 

=  100,  =  150 

0.884  (0.0032) 

0.884  (0.0032) 

0.999  (0.0003) 

0.999  (0.0003) 

16 


In  this  simulation,  the  bootstrap  A  has  slightly  higher  power  than  the  bootstrap  Q  for 
small  samples,  but  there  is  little  difference  for  large  samples. 

Example  3:  Mixture  of  Categorical  and  Continuous  Variables 

In  this  example  we  consider  a  mixture  of  continuous  and  discrete  variates.  Of  the 
discriminant  functions  in  the  previous  sections,  only  the  A  statistic  has  been  studied  for 
this  case.  Suppose  the  variable  V  is  a  mixture  of  discrete  and  continuous  variables.  Let 
V'  =  {Z,  X)  with  Z=  {Z^,  ,  Zr)  and  X  =  (Xj,  .  .  .  ,  Xg)  where  Z^,  .  .  .  ,Zr  are 

discrete  and  X^,  .  .  .  ,  Xg  are  continuous,  r  and  q  are  positive  integers.  Suppose  further 
the  yth  discrete  variable  Zj  has  k-  categories,  j  =  1,  .  .  .  ,  r.  Then  the  vector  of  discrete 
variables  Z  may  be  expressed  as  a  multinomial  random  variable  =  {Y^,  ...  , 
where  Yj^  =  0  or  1,  m  =1,  .  .  .  ,  k,  12^=1  ^  “  nj=:i  Thus,  each 

distinct  pattern  of  Z  defines  a  multinomial  cell  of  Y  uniquely.  It  is  assumed  that  the 
probability  of  obtaining  an  observation  in  cell  m  for  tt,  is  p^^,  (0  <  p^  <  Ij  m=l 
p^  =  1),  i  =  0,  1.  Then  the  joint  probability  density  function  of  V  in  x- is  given  by  (1), 
where  parameters  of  X  given  Y. 

For  the  population  x^,  the  conditional  pdf  of  X  given  Y,  I  t>e  of 

any  proper  type  depending  on  the  relationship  between  X  and  Y.  Following  Olkin  and 
Tate  (1961),  for  this  example  we  assrime  that  X  has  a  conditional  multivariate  normal 
distribution  with  mean  given  Y  belonging  to  cell  m  and  common  covariance  matrix 
in  all  cells.  K  Y  belongs  to  cell  m,  i.e.,  if  Y  =  (Yi^-",Y„,_i  Y„,  Y,„  +  i;*-,Yjt )  =  (0, 
...,  0,  1,  0,  -,  0),  then  /,■  y(Y  |  0^)  and  /.  X|y(^I  ^]y’  of  (1)  are  given  as  follows; 
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Let  the  ^h  member  of  the  training  sample,  •  .  •  >  v^J  from  tt-  be 

denoted  by  =  (yj*\  xj*^)},  where  is  the  vector  of  binary  variables  obtained 

from  the  discrete  components  z  of  vj*^,  and  xj*^  is  the  vector  of  continuous  variables.  Let 
denote  the  number  of  individuals  of  the  training  sample  from  tt,  that  fall  in  cell  m 
defined  by  Y.  Then  N-  =  i  =  0,  1.  The  likelihood  of  the  two  training 

samples  is  given  by  _ 

n  [{n  (?'»?)"”  }{(2’')’is<''’i}'^ 

2=0  m—l 

■  5  £  (==$■’  -  -  '‘v)}i  >  («) 

j=i 

where  takes  the  value  if  y^*^  falls  in  the  mth  cell,  m  =1,.  .  .,k. 

Consider  now  the  new  individual  v  to  be  classified,  and  suppose  that  the 
discrete  components  place  it  into  cell  1.  K  this  individual  is  included  with  the  training 
sample  from  x j,  then  an  extra  multiplying  factor 


must  be  incorporated  in  (4)  to  construct  the  generalized  likelihood  ratio  test  statistic  of 
(3).  xj*^  must  belong  to  one  of  k  subgroups  corresponding  to  the  conditional 
distributions  depending  on  the  value  of  yj’^  for  j  =  1,  .  .  .  ,  iV^,  i  =  0,  1.  Let  be  the 
sth  member  of  mth  subgroup  of  the  continuous  variable  measurements  whose  discrete 
covariates  fall  in  the  mth  cell.  Then  any  element  of  belongs  to  one  of  k 

subgroups  where,  of  course,  some  of  the  could  be  zero.  Hence  we 

can  rewrite  the  exponent  of  (4)  as 
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The  MLEs  under  Hq  and  are  given  in  the  appendix,  and  the  log  likelihood  ratio 
statistic  is  given  in  (A7). 

Krzanowski  (1982)  considered  a  similar  likelihood  ratio  statistic  when 
(=  S),  and  and  S  are  estimated  by  a  second-order  regression  model  of  X  on  Y. 
Then  he  allocated  a  new  individual  to  ttq  if  his  likelihood  ratio  statistic  is  greater  than 
or  equal  to  1  and  to  otherwise.  He  did  not  consider  the  problem  when  it  is  desired  to 
control  one  of  the  misclassification  errors. 

We  investigate  the  performance  of  the  bootstrap  log  likelihood  ratio  test  by 
examining  the  power  with  a  simulation.  We  consider  a  simple  situation  in  which  we 
have  a  discrete  variable  from  a  Bernoulli(p)  distribution  and  an  independent  continuous 
variable  distributed  N{fi,  a^).  For  i  =  0,  1,  let  {v^*^  =  ^  random 

sample  from  tt,-,  where  ~  Bernoulli(pj)  and  ~  cr^).  Let  v  =  (z,  x)'  be  a 

new  observation  to  be  classified  where  z  ~  Bernoulli(p]^)  and  x  ~  cr^). 

We  examine  the  power  of  the  bootstrap  A,  P^(l|l),  for  different  parameter  values. 
We  set  Pq  =  0.1,  /Xq  =  0,  <tq  =  0.5,  and  cr^  =  1,  For  p^  =  0.9,  0.7,  and  0.5,  the  estimate 
of  is  obtained  for  juj  =  0.5  -1-  Actj  where  A  =  {0,  0.5,  1,  1.5,  2,  2.5,  3}.  The 

power  estimate,  Pj^(l|l),  is  the  proportion  of  times  that  the  A  statistic  value  is  less  than 
or  equal  to  A^  out  of  2000  trials,  where  A^  is  the  bootstrap  cut-off  point  at  a 
significance  level.  With  Nq-=  N^-  50,  B  =  299,  and  a  =  0.05,  these  power  estimates 
are  plotted  in  Figure  2.  As  the  separation  between  and  increases,  the  power  of  the 
bootstrap  likelihood  ratio  test  increases.  The  plot  also  shows  that  the  larger  differences 
between  Pq  and  p^  produces  the  better  power  curves.  Simulations  were  also  performed 
to  verify  the  significance  level  of  the  test.  The  results  were  good  and  essentially  the 
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same  as  Table  1. 


Example  4'  A  Real  Data  Example 

Unfortunately  no  suitable  imclassified  data  with  categorical  variables  comparing 
nuclear  explosions  to  earthquakes  are  available  for  this  report.  However,  there  is  a 
considerable  amount  of  mining  explosion  data  available  as  training  data.  Therefore,  to 
illustrate  the  method  developed  here,  we  have  applied  the  bootstrap  generalized 
likelihood  ratio  test  to  observations  at  the  ARCESS  seismic  array  in  Norway  which 
consist  of  mining  blasts  from  two  separate  mines  (HB6  and  HD9)  located  in  the  Kola 
Peninsula  of  the  former  Soviet  Union.  (For  other  applications  of  the  bootstrap 
generalized  likelihood  ratio  test  to  seismic  event  identification,  see  Fisk  and  Gray 
(1993);  Fisk  et  al.,  (1993).)  Fifteen  blasts  were  observed  from  mine  HB6  and  sixteen 
blasts  were  observed  from  mine  HD9. 

’  The  variables  used  here  are  day-of-the-week  (DOW),  slowness  (inverse  group 
velocity  measured  in  seconds/degree)  of  Pn  (SLOW),  and  rectilinearity  of  Pn  (RECT). 
Pn  is  typically  the  first  prominent  portion  of  the  seismogram  to  arrive  for  signals 
observed  at  regional  distances  (<2000  km).  These  data  are  part  of  a  data  set 
established  by  Sereno  and  Patnaik  (1992)  as  a  testbed  for  seismic  signal  identification 
problems.  Other  features  are  also  available  in  this  data  set,  but  most  have  many 
missing  data  values,  a  problem  we  are  currently  addressing. 

A  histogram  plot  of  DOW  is  plotted  in  Figure  3  for  the  two  sets  of  mining  blasts. 
Note  that  the  HD9  blasts  occur  predominantly  on  day  6,  while  the  HB6  blasts  occur 
more  uniformly  throughout  the  week.  Dot  plots  of  the  continuous  variables  are  shown 
in  Figure  4.  SLOW  exhibits  relatively  good  separation,  while  there  is  considerable 
overlap  for  RECT. 
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In  order  to  assess  the  value  of  the  discrete  variable  we  considered  cases  in  which 
DOW  is  either  included  or  excluded.  Since  the  day  on  which  an  event  occurred  has  no 
influence  on  the  seismogram,  we  treated  the  continuous  variables  as  independent  of 
DOW.  Furthermore,  we  assumed  unequal  covariance  matrices  since  the  variances  for 
SLOW  are  significantly  different.  Setting  the  significance  level  at  0.01  and  0.05,  we 
estimated  the  power  using  the  bootstrap  with  and  without  DOW.  Table  3  gives  the 
results  using  both  continuous  variables,  while  Table  4  gives  the  results  using  only 
RECT,  with  and  without  DOW.  Since  SLOW  is  such  a  strong  discriminator.  Table  4 
better  demonstrates  the  power  that  may  be  gained  by  making  use  of  an  available 
discrete  feature. 

Table  3.  Bootstrap  estimates  of  power  using  both  SLOW  and  RECT. 


Sicmificance 

DOW  excluded 

DOW  included 

0.01 

0.962 

0.982 

0.05 

0.980 

0.986 

Table  4.  Bootstrap  estimates  of  power  using  RECT. 


Siffnificance 

DOW  excluded 

DOW  included 

0.01 

0.266 

0.377 

0.05 

0.529 

0.736 

The  power  was  estimated  in  these  tables  using  a  parametric  bootstrap  approach. 
Specifically,  given  the  training  samples  of  size  Nq  =  15  and  =  16  available  from  the 
two  mines,  ttq  =  HB6  and  Xj  =  HD9,  ML  estimates  of  the  associated  parameters  axe 
obtained.  For  these  data,  the  bootstrap  is  used  to  estimate  the  a-level  critical  value  by 
simulating  B  =  499  replications.  Each  replication  consists  of  training  samples  of  sizes 
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Nq  and  Ni  from  the  models  fit  to  ttq  and  tt^  along  with  an  observation  to  be  classified 
which  is  generated  according  to  the  model  for  ttq.  As  in  the  previous  examples,  the  or- 
level  critical  value,  A^,  was  obtained  from  the  likelihood  ratio  statistics  calculated  from 
these  replicates.  The  power  is  then  estimated  by  again  simulating  B  bootstrap 
replications,  where  each  replicate  consists  of  training  samples  of  sizes  Nq  and  N-^  from 
the  models  fit  to  ttq  and  along  with  an  observation  to  be  classified  which  this  time  is 
generated  according  to  the  model  for  The  power  is  estimated  as  the  proportion  of 
the  resulting  B  likelihood  ratio  statistics  that  are  less  than  or  equal  to  A^.  A  cross- 
validation  procedure  was  also  considered,  and  it  gave  results  similar  to  those  shown 
here.  Efron  (1983)  has  suggested  an  alternative  bootstrap  approach  to  remove  the  bias 
from  the  cross-validation  estimate. 

4.  Concluding  Remarks 

When  one  needs  to  classify  an  individual  with  one  of  the  misclassification 
probabilities  under  control  but  does  not  know  the  exact  or  limiting  distribution  of  the 
statistic  for  classification,  the  bootstrap  likelihood  ratio  method  is  shown  to  be  useful. 
The  statistic  used  for  classification  is  derived  from  the  likelihood  ratio,  and  its  limiting 
distribution  furnishing  the  discriminant  cut-off  point  is  approximated  successfully  by 
the  parametric  bootstrap. 

The  bootstrap  likelihood  ratio  statistic  is  shown  to  compete  well  with  the 
statistics  W  and  Z  whose  limiting  distributions  are  known,  for  moderate  sample  sizes 
when  two  multivariate  normal  distributions  with  equal  covariance  matrices  are 
considered.  It  also  performs  quite  well  for  both  the  multivariate  normal  case  with 
unequal  covariance  matrices  and  the  case  of  a  mixture  of  binary  and  normal  variates, 
where  classical  classification  rules  cannot  control  the  probability  of  misclassification. 
Moreover,  the  methodology  considered  here  can  be  applied  to  any  non-normal  discrete 


22 


or  continuous  variable,  and  to  any  mixture  of  continuous  and  discrete  variables, 
whenever  the  MLEs  exist.  It  should  be  noted  that  the  precision  of  the  test  depends  on 
the  sample  sizes  Nq  and  iVj,  and  the  bootstrap  replication  size  B.  Small  sample  sizes 
may  result  in  MLEs  for  the  parametric  bootstrap  which  are  not  close  to  the  true 
parameter  values.  Adequate  sample  sizes  for  different  dimensions  of  the  classification 
variable  may  need  to  be  studied.  Finally,  it  should  be  noted  that  the  method  applied 
here  could  be  applied  to  any  test  of  hypothesis  based  on  the  generalized  likelihood  ratio. 
Actually,  the  approach  considered  here  of  calculating  A  based  on  normal  likelihoods  and 
finding  A^,  should  be  a  sensible  approach  for  continuous,  unimodal  distributions.  The 
robustness  of  this  procedure  is  the  topic  of  current  research. 


Appendix:  Formulas  Related  to  Examples 


Example  1 

is  estimated  by  v^*)  =  and  is  estimated  by 


Nq  +  N^-2 


(Al) 


where  = 
is  given  by 


— (v^—  v^*^)7(^i  — l)i  i  =  Oj  1-  Anderson’s  W  statistic 


(A2) 


Under  the  null  hypothesis  Hq, 


the  MLEs  of  and  E  are 


=  (iVoV^®)  +  y)/{Nq  +  1), 
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-  iVj  +  AT;  +  1 


y(0))(v  _7(0))' 


? 


where  A  =  (r^-yW)  (yW  -y(9)' =  (Ar„  +  Wj -2)S.  Under  the 

alternative  hypothesis  the  MLEs  of  the  parameters  are 


=vW 

+  v)/(i\ri  + 1), 

In  this  case  the  likelihood  ratio  given  in  (3),  with  ^q)’  ^q)’ 

=  El),  and  Ej)  is,  therefore, 


^  ^  1 
^0  =  Nq  +  N^  +  1 


N  +  ^  (v  -  v(®))'S'^(v-v(®)) 


-/iVo+Ari+l)/2 


? 


(A3) 


where  N=:  Nq  +  N-^—  2.  The  likelihood  ratio  (A3)  is  characterized  by  John’s  Z  statistic, 


j5^(y-y(‘))'s-‘(y-y('>)- 


_,(»)) 


Thus 
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A  =  log  {N+  (v  -  S'l  (v 

{^+  iV^  (A4) 


Example  2 

The  quadratic  discriminant  function  is  given  by 


Q  =  1  log  (i^)  +  I  [(v  - 


_{v  (v  ]  . 


(A5) 


The  MLEs  of  under  Hq  are 


Af  =  +  v)/{N^  +  1), 


=  v(l) 


Ao  +  1 


where  =  T,^i  -  v^*^)  (vj*^  -  i  =  0,  1.  Under  the  alternative  hypothesis 
Hi,  the  MLEs  are 

a1«>  =  v(»), 
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+  t)/(«i  +  1), 


£  i‘’ = siVi  k'’ + 5^  (''  - 


The  log  likelihood  ratio  statistic  is  given  by 


-(A-„  +  l)log{(«'„-l)  +  5^(v-v(»))'(s(“)rl(T-v(»))}]  +C{No,N^)  (A6) 


where  —  1),  i  =0,  1,  and 


C(iVo,iVi)=log 


(iV,_l)(^i  +  l-P)/2  ^^NoP/2  +  + 


Example  3 


We  consider  the  log  likelihood  ratio  statistic  under  the  scenario  discussed  in  Example  3, 
i.e.  the  new  individual  v  to  be  classified  has  discrete  components  that  place  it  into  cell 
1.  The  likelihood  functions  on  the  numerator  and  denominator  of  (3)  are  given  by 
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+  ,  *  =  0>  1- 


Under  Hq  the  MLEs  of  /x^,  are 


pS  =  ^SSV(^o  +  1)’  ^  =1,  .  •  •  ,  ^-1,  ^+1,  ...,k, 

P?'^  ~  +  l)/(-^0  +  ^)’ 

PmJ  =  m  =1,  .  .  .  ,  /-I,  l+l,  .  .  .  ,k, 


P  i°(?  =  x)/(nj®)  +1)  , 


Nq+I 


M 


+ 


jvjO) 


ivj^)  4-  1 


(x  —  x^®^)(x  — 


Pml  =  ^mV^v  m=l,...,k, 
Pml  =  m=l,...,k, 


Ei«  =  iA('), 


where  x^  =  E^i4!l/^S^^.  =  E^i(45Ji  -xS^^)(xl*i-xS^V,  m  =1,. 

=  Em=l  Am-  Under  the  alternative  hypothesis  the  MLEs  are 


.,  k,  and 
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=  ”iSV«o. 


/•SI  =  ro  =  1, .  . .  ,  i, 


Ef  =  ±  A«»  , 


=  +  1),  m  =  1,  1,  (+1,  ...  ,  i, 


#(})  =  (41)+  l)/(Ni  +  1), 


ml  =  Tn=  1,  .  .  .  ,  l-l,  1+  1,  .  .  .  ,  k, 


+  1)  , 


^ (^)  —  1 

~  N.+l 


a(i)  + 


iV^/^  +  1 


(x  — xP)(x  — 


Since  the  exponential  term  of  after  replacing  the  parameters  by  their  MLEs,  is 
exp{  —  (l/2)q(JVQ  +  iVj  4-  1)}  for  i  =  0,  1,  the  log  likelihood  ratio  statistic  is  given  by 
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LAMBDA 

Figure  1(a).  Plots  of  distribution  functions.  Solid  line;  true  null  distribution  of  A,.  Broken  line:  empirical  null 
distribution  of  X.  No  =  10.  N i  =  1 5. 
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of  distribution  functions.  Solid  line:  true  null  distribution  of  X.  Broken 


Figure  2.  Power  curves  of  bootstrap  X  with  mixed  binary  and  continuous 
variables,  po  =  0.1.  DELTA  denotes  A. 
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Figure  3.  Histogram  plot  of  the  categorical  variable  day-of-week 
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A  HYPOTHESIS-TESTING  APPROACH  TO  DISCRIMINANT  ANALYSIS 
WITH  MIXED  CATEGORICAL  AND  CONTINUOUS 
VARIABLES  WHEN  DATA  ARE  MISSING 


James  W.  Miller,  Wayne  A.  Woodward,  and  Henry  L.  Gray 


ABSTRACT 

In  this  paper  we  consider  the  problem  of  discriminant  analysis  with 
discrete  (categorical)  and  continuous  variables  with  data  missing  at 
random.  We  use  a  hypothesis-testing  approach  based  on  the  generalized 
likelihood  ratio  as  proposed  by  Baek,  et  al.  (1994).  We  use  bootstrapping 
to  determine  critical  values  in  order  to  control  the  Type  I  error  rate.  We 
present  three  algorithms  for  dealing  with  this  case,  each  assuming  a 
different  model  for  the  data;  (1)  TLe  INDICATOR  algorithm  replaces 
categorical  variables  with  indicator  variables,  and  treats  these  as  if  they 
were  continuous;  (2)  the  FULL  algorithm  assumes  a  multinomial 
distribution  for  the  discrete  part,  and  a  multivariate  normal  distribution 
(with  mean  and  covariances  depending  on  the  discrete  part)  as  the 
conditional  distribution  of  the  continuous  part  given  the  discrete  part;  and 
(3)  the  COMMON  algorithm  assumes  a  multinomial  distribution  for  the 
discrete  part,  and  a  multivariate  normal  distribution  (with  only  the  means 
depending  on  the  discrete  part)  as  the  conditional  distribution  of  the 
continuous  part  given  the  discrete  part.  (That  is,  a  common  covariance 
matrix  is  assumed  across  all  multinomial  cells.)  The  performance  of  these 
algorithms  is  compared  through  a  simulation  study.  While  the 
INDICATOR  algorithm  seems  to  have  highest  power,  it  also  tends  to 
display  a  higher  Type  I  error  rate  than  desired.  The  FULL  and  the 
COMMON  algorithms  have  very  similar  power,  but  the  COMMON 
algorithm  appears  to  control  the  Type  I  error  rate  most  effectively,  and  is 
least  susceptible  to  problems  occurring  when  some  multinomial  cells  are 
sparsely  represented. 
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1  Introduction 


In  Baek,  Gray,  Woodward,  Miller,  and  Fisk  (1994)  (subsequently  abbreviated 
BGWMF)  techniques  are  given  for  a  hypothesis-testing  approach  to  discriminant  analysis 
in  which  one  wishes  to  control  one  of  the  probabilities  of  misclassification.  Methods  are 
presented  for  continuous  variables  only,  as  well  as  for  a  mixture  of  continuous  and 
categorical  variables.  Essentially,  the  hypothesis-testing  approach  based  on  the  ratio  of 
maximized  likelihood  functions  proposed  by  Krzanowski  (1982)  is  employed  and  the  test 
statistic  is  bootstrapped  in  order  to  estimate  critical  values  for  the  allocation  rule  in  such  a 
way  that  the  error  rate  is  controlled.  In  Miller,  Gray,  and  Woodward  (1993) 
(subsequently  abbreviated  (MGW)),  a  similar  hypothesis-testing  approach  is  used  for 
discriminant  analysis  and  outlier  detection  in  the  presence  of  missing  data.  The  EM 
algorithm  (Dempster,  Laird,  and  Rubin  (1977))  is  employed  to  obtain  maximum 
likelihood  estimates  of  model  parameters  and  compute  the  maximized  likelihoods  based 
on  the  available  data.  That  paper,  however,  only  considers  the  case  in  which  all  variables 

are  continuous  and,  in  fact,  normally  distributed. 

In  this  report,  we  wish  to  consider  the  remaining  case  in  which  we  have  a  mixture 
of  continuous  and  categorical  variables  used  as  discriminants,  and  also  missing  data, 
potentially  in  both  the  training  sets  and  in  the  new  observation  to  be  classified.  Once 
again,  we  use  a  hypothesis-testing  approach  to  classification  and  bootstrap  the  test 
statistic  in  order  to  control  the  probability  of  a  particular  type  of  misclassification.  We 
present  three  algorithms  for  handling  this  situation: 

(1)  The  INDICATOR  algorithm  -  This  algorithm  begins  by  converting  each  categorical 
variable  with  j  categories  into  j  -  1  indicator  variables.  This  results  is  a  larger 
number  of  variables  (unless  all  categorical  variables  are  already  binary,  in  which 
case  the  data  set  is  unchanged).  These  indicator  variables  can  be  analyzed  using 
techniques  for  quantitative  data.  In  this  algorithm  we  make  the  (obviously 
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incorrect)  assumption  that  all  variates  are  continuous  and,  in  fact,  normally 
distributed.  We  then  perform  discriminant  analysis  using  the  transformed  data  and 
the  techniques  of  MGW. 

(2)  The  FULL  algorithm  -  Next,  we  model  the  joint  distribution  of  each  observation  in 
the  following  manner:  Suppose  each  observation  consists  of  p  categorical  variables 
and  q  continuous  variables.  The  categorical  variables  define  r  cells  of  a  contingency 
table  in  which  the  observation  could  fall,  where  r  is  the  product  of  the  number  of 
categories  possible  within  each  categorical  variable.  We  assume  that  the 
observation  will  fall  into  cell  i  (i  =  1,  ...,  r)  with  probability  pj,  and  that  the 
conditional  distribution  of  the  continuous  part  given  that  the  discrete  part  places  the 
observation  into  cell  i  is  multivariate  normal  with  mean  pj  and  Sj.  We  then  employ 
the  EM  algorithm  to  obtain  maximum  likelihood  estimates  of  parameters  in  this 
model  and  compute  maximized  likelihoods  of  the  available  data,  and  bootstrap  the 
ratio  of  maximized  likelihoods,  as  was  done  in  BGWMF. 

(3)  The  COMMON  algorithm  -  This  algorithm  is  essentially  the  same  as  the  FULL 
algorithm,  except  that  we  assume  a  common  covariance  matrix  across  all 
multinomial  cells.  That  is,  the  conditional  distribution  of  the  continuous  part  given 
that  the  discrete  part  places  the  observation  into  cell  i  is  assumed  multivariate 
normal  with  mean  pj  and  E,  with  E  no  longer  depending  on  i.  This  reduces 
considerably  the  number  of  parameters  that  need  to  be  estimated  and  makes 
possible  calculation  of  the  likelihood  ratio  statistic  when  some  cells  may  be  sparsely 
represented,  or  not  represented  at  all. 

Simulation  studies  are  conducted  to  compare  and  contrast  the  performance  of  each 
of  these  procedures  with  regard  to  their  ability  to  accurately  control  the  Type  I  error  rate, 
and  with  regard  to  their  power. 
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9  Notatinn  and  Overview  of  the  Generalized  Likelihood  Ratio  Test  Procedure 

Suppose  we  wish  to  classify  a  (p+q)-dimensional  random  vector  V  into  one  of  two 
populations  tij  or  712-  Suppose  further  that  V  can  he  partitioned  as  V  =  (X,  Y),  where  X  = 
(X|,  X2,  Xp)  is  a  p-dimensional  vector  of  categorical  variables  and  Y  =  (Yj,  Y2, 

Yq)  is  a  q-dimensional  vector  of  continuous  variables.  Suppose  that  for  i  =  1,  p,  the 
variable  Xj  takes  on  one  of  the  rj  possible  values  1,  2, q.  Then  the  vector  X  takes  on 
one  of  r  =  nf^^rj  possible  values.  We  let  ^  denote  the  set  of  all  possible  values  of  the 
vector  X.  Finally,  suppose  that  training  samples  i  =  1, ...,  Nj  from  ttj  and  {v| 

i  =  1, ...,  N2  from  1x2,  each  having  the  same  structure  as  V,  are  available,  and  that  data 
may  be  missing  at  random  from  any  part  of  V  or  from  the  training  samples. 

The  generalized  likelihood  ratio  test  (GLRT)  procedure  for  classifying  V  into  Uj 
or  712  based  on  a  hypothesis  testing  approach.  That  is,  the  classification  of  V  is  done  by 
testing 


The  two  misclassification  probabilities  that  we  will  be  interested  in  are  P(2|l)  and  P(112), 
where  P(ilj)  denotes  the  probability  of  classifying  V  into  TTj  when  in  fact  V  €  txj.  We  will 
refer  to  a  =  P(21 1)  as  the  significance  level  for  the  test  and  P(2|2)  as  the  power. 

Let  m  denote  the  number  of  elements  in  V  that  are  missing  and  let  V(2)  =  (X(2)5 
Y(2))denote  the  (p  -  m)-variate  vector  of  available  data  in  V.  Similarly,  let  m^  denote  the 
number  of  elements  missing  from  and  let  denote  the  (p  -  mP)-variate  vector  of 

available  data  in  (j  =  1,  2;  i  =  1,  2,  ...,  Nj).  We  assume  that  has  joint  density 
function  f(V10(^))  and  that  7^2  has  joint  density  function  f(V19<^2)),  where  f  is  some 
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parametric  density  function  with  parameters  and  G^^^  for  populations  and  ^2’ 
respectively.  Then,  under  Hg,  the  likelihood  of  V  and  the  training  samples  is  given  by 


Lo,(0(»|  V,  v*'),  V”.  -  .v«J)Lo2(era|  vf ,  xfK ...  .vg’), 


where 


(2) 


Ni 


V,  V^2  \  ...  =  f2(VlG(l))nfii(VPlG0)), 


N2 


Lo2(®^^^1  vf  \  vf , ...  ,vf ) = nf2i(vf  ^|g(2)), 

2  i  1 


f2(V|G(^))  is  the  marginal  density  function  for  V(2)  evaluated  at  V(2)  with  parameters  G(^\ 
and  fjj(VPlGO))  is  the  marginal  density  function  for  evaluated  at  with 

parameters  G<JX  Under  Hj,  the  likelihood  of  V  and  the  training  samples  is  given  by 


L„(6(»|  V<'>,  V<‘), ...  .V<")L,2(9m|  V.  V' 


(2)  v(2) 

1  ,  ’*'2  > 


V(2)) 


where 


(3) 


Ni 


L,  ,(e(')|  v<'\  vf , ...  .Vf )  =  n  f„(vf  ie(»), 


and  f2(V|G(2))  ig  the  marginal  density  function  for  V(2)  evaluated  at  V^2)  with  parameters 
G(2).  We  emphasize  that  these  are  the  likelihood  functions  for  the  available  data  rather 
than  the  likelihood  functions  for  the  complete  data  since  f2  and  fjj  (j  =  1,  2;  i  =  1,  2, ..., 
Nj)  are  marginal  densities  for  the  available  part  of  each  observation,  rather  than  the 
likelihood  functions  for  the  complete  data. 
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The  GLRT  procediire  is  based  on  the  ratio 


LR  = 


,eCl  *•  -  . 

(9(1)>))'-I  ''2 ’•  -  ''f ’• ''®  - 


(4) 


Loi(^0  *1 V.  ''?■  vf . vjlSzffll  vf . ''n!> 

L„(6™|  V<'>.  V® v|]|)L,2(ef  >1 V,  vf .  vf , ...  .vg')’ 


where  6®  and  0®  are  maximum  likelihood  estimates  of  0^^  (j  =  1,  2)  under  the  null  and 
alternative  hypotheses,  respectively.  That  is,  0q  ^  is  the  MLE  of  0^^^  based  on  the  sample 
{V,  ...  ,  0Q^^  is  the  MLE  of  0^2^  based  on  the  sample  ...  , 

vjh,  and  0^^^  is  the  MLE  of  0(^)  based  on  the  sample  {V^^\  ... ,  and  0^^  is 

1  1  .i  IN}  1 

the  MLE  of  0^^^  based  on  the  sample  {V,  ... , 

Equivalently,  the  test  procedure  may  be  based  on  the  statistic 


X  -  log(LR)  -  Lqj  +  Xq2  ■  ^11  ■  ^12’ 


(5) 


where 

X  0,  =  logfjCVie™)  +  Siogf,i(vP'|9''>).  (6) 

1=1 

Xii-EiogfiiCvl'Wi'^iuid 

x,2  =  logfjcvief )  +  E  iogf2i(v<^>|ef  >). 

A  key  step  in  evaluating  X  for  a  given  sample  is  the  computation  of  the  maximum 
likelihood  estimates  and  the  corresponding  maximized  log-likelihood  functions  Xqj,  Xq2, 
Xu,  and  X^  in  Equation  (6).  This  is  no  trouble  when  the  data  are  complete,  as  illustrated 


43 


by  (BGWMF).  However,  in  the  presence  of  missing  data,  the  usual  expressions  for 
maximum  likelihood  estimates  are  no  longer  valid.  In  this  case,  maximum  likelihood 
estimates  are  obtained  via  the  EM  algorithm  (Dempster,  Laird,  and  Rubin  (1977)).  The 
EM  algorithm  is  an  iterative  procedure  for  obtaining  parameter  estimates  which 
maximize  the  likelihood  function  of  the  available  data.  It  involves  two  key  steps: 

(E  -step)  -  Using  current  estimates  0^^^  (where  k  now  denotes  the  current  iteration  step, 
rather  than  designating  tij  or  7:2),  estimate  the  values  of  the  complete  data 
sufficient  statistics  by  computing  their  expectations  given  the  available  data. 
(M-step)  -  Determine  the  values  of  the  parameters  which  maximize  the  likelihood  for  the 
complete  data  based  on  the  current  estimates  of  the  complete  data  sufficient 

A/k-j-n 

statistics,  thus  yielding  0^ 

A/k') 

The  EM  algorithm  iteratively  performs  E-  and  M-steps  until  the  sequence  {0  } 

converges  to  an  adequate  approximation  to  the  MLE.  To  evaluate  the  test  statistic  X  of 

^(1) 

Equation  (5),  we  must  implement  the  EM  algorithm  four  times.  That  is,  0^  and  are 
based  on  the  sample  {V,  ... ,  V^|},  0^  and  ^02  are  based  on  the  sample  (vf  \ 

... ,  0^^^  and  X,,  j  are  based  on  the  sample  ... ,  V^^},  and  0^"^  and 

7-12  are  based  on  the  sample  {V,  ... , 

The  decision  rule  is  described  as  follows:  small  values  of  X  provide  evidence  in 
favor  of  Hj,  hence,  V  is  classified  into  712  if  ^  ^  X^,  otherwise  v  is  classified  into  Ttj.  The 
cut-off  value  X^  is  chosen  so  that  P(2|l)  =  a,  the  desired  significance  level  for  the  test. 
Since  the  null  distribution  of  X,  is  not  known,  the  critical  value  is  approximated  by  the 
parametric  bootstrap  (Efron  1979).  For  some  large  integer  B,  B  bootstrap  samples  {V  , 
y*(i)  are  simulated  from  a  distribution  with  density  f(V10^^^)  and  B 

bootstrap  samples  ... ,  are  simulated  from  a  distribution  with  density 

f(V|0*-^^),  where  0^^^  and  0^^^  are  MLEs  obtained  from  the  samples  {Vj^\  \  ... , 
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and  ... ,  V^^},  respectively.  (Notice  that  in  this  case,  0^^^  =  0j^^  and  - 

0^^^.)  When  there  are  missing  values,  the  simulated  bootstrap  samples  should  also  have 

missing  values  in  a  configuration  similar  to  that  in  the  actual  data.  For  each  bootstrap 

♦  *  * 

sample,  the  test  statistic  X  is  computed,  thus  generating  a  random  sample  {>.j,  X^, ... , 

of  variates  that  have  approximately  the  same  distribution  as  X  under  Hq.  For  an  a-level 

^  *  *  * 
test,  the  cut-off  value  X^  is  chosen  as  the  a-th  empirical  quantile  of  X^ . X.g}. 

*  * 

Finally,  V  is  classified  into  tIj  if  X  >  V  is  classified  into  712  if  X.  ^  X.^- 

As  was  pointed  out  in  (MGW),  this  test  procedure  is  only  an  approximation  to  the 
true  GLRT  procedure  since  the  critical  value  is  obtained  via  bootstrapping  and  we  may 
further  relax  our  approximation  to  the  true  GLRT  procedure  by  relaxing  the  number  of 
iterations  performed  by  the  EM  algorithm.  That  is,  we  may  choose  a  stopping  criterion 
for  the  EM  algorithm  that  does  not  continue  iteration  until  convergence  has  been  obtained 
to  a  high  degree  of  accuracy.  Whatever  the  stopping  criterion,  bootstrapping  the  test 
statistic  insures  an  approximate  a-level  test.  As  in  (MGW),  it  would  appear  that  very 
little  power  is  lost  by  only  performing  a  very  few  iterations  of  the  EM  algorithm,  as 
opposed  to  carrying  out  iterations  until  MLEs  are  obtained  with  a  high  degree  of 
accuracy.  In  Section  6,  we  often  use  only  three  iterations  as  standard  practice  in  our 
simulation  studies. 

Our  implementation  of  the  GLRT  procedure  for  discriminant  analysis  is 
summarized  in  Figure  1.  Figures  2,  3,  and  4  further  describe  the  bootstrapping  module, 
the  computation  of  the  test  statistic  X-,  and  the  EM  algorithm  for  obtaining  MLEs.  Each 
of  the  various  algorithms  discussed  in  this  paper  share  this  common  skeletal  structure. 
The  differences  lie  in  the  type  of  model  being  assumed  for  the  data,  the  corresponding 
implementation  of  the  EM  algorithm  for  obtaining  MLEs,  and  the  precise  formulas  used 
to  evaluate  the  maximized  log-likelihood  functions. 
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■3.  The  INDICATOR  Algorithm 

In  our  first  attempt  to  implement  an  algorithm  for  discriminant  analysis  for  mixed 
categorical  and  continuous  variables  with  missing  data,  we  desired  to  use  the  methods 
presented  in  MGW  with  as  little  adaptation  as  possible.  One  way  to  do  this  would  be  to 
treat  the  categorical  variables  as  if  they  were  continuous  and  use  the  procedures  of  MGW 
without  any  alteration  at  all.  This  is  perhaps  not  such  a  bad  idea  if  categorical  variables 
have  a  large  number  of  categories,  if  these  categories  have  a  natural  ordering,  and  if  the 
distribution  of  this  variable  has  a  somewhat  normal  shape.  In  most  cases,  however,  these 
conditions  are  not  satisfied  and  the  procedure  would  be  totally  inappropriate. 

A  modification  to  the  above  approach  is  to  replace  each  categorical  variable  with 
indicator  variables  in  the  following  manner:  We  replace  each  categorical  variable  Xj 
(i  =  1, ... ,  p)  from  V  with  the  rj  - 1  indicator  variables 


Wy  I(Xj  j)  |q 


a=l,2,...,ri  -1). 


(7) 


Hence,  the  vector  X  of  categorical  variables  gets  replaced  by  a  vector  W  of  binary 
variables  of  length  -  p,  producing  the  transformed  vector  V  =  (W,  Y).  If  Xj  is 
missing  in  X,  then  each  Wy  (j  =  1, 2, ... ,  r^  - 1)  is  missing  in  W.  We  transform  the 
training  samples  Vp^  G  =  U  2;  i  =  1, 2, ... ,  Nj)  in  a  similar  manner  producing  V®  G  = 
2;i=l,2,...,Nj). 

Now,  having  transformed  each  observation,  we  classify  V  by  classifying  V 
according  to  the  GLRT  procedure  as  outlined  in  (MGW)  for  the  continuous-variables- 
only  case  with  missing  data  using  the  transformed  data  V  and  G  “  2;  i  =  1,  2, ..., 

Nj).  That  is,  we  proceed  as  if  V  and  G  =  1.  2;  i  =  1,  2,  ...,  Nj)  were  normally 
distributed,  ignoring  the  fact  that  many  of  the  components  are  binary.  In  simulation 
studies  (see  Section  6,  below),  we  see  that  this  method  actually  performs  about  as  well  as 
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methods  based  on.  a  more  plausible  model  for  the  categorical  variables,  and  is  much 
easier  to  implement. 

4  The  FUTJ.  Algorithm 

Next,  we  derive  the  GLRT  procedure  using  a  more  plausible  model  for  the 
distribution  of  V.  In  this  case,  we  assume  that  the  distribution  of  X  follows  a  multinomial 
distribution  in  the  sense  that  Pr[X  =  x]  =  p^  for  each  x  e  T,  and  the  conditional 
distribution  of  Y  given  X  =  x  is  multivariate  normal  with  mean  and  covariance  matrix 
and  Hence,  0  =  {p^,  x  e  'F}  and 

f(v|e)  =  PxMVN(y|Px,5:^),  (8) 

where  MYNCylp^^  ^x)  <ienotes  the  value  of  the  multivariate  normal  density  function  with 
parameters  and  evaluated  at  y. 

The  first  step  in  deriving  the  GLRT  procedure  is  to  develop  the  EM  algorithm  for 
obtaining  MLEs  of  0  given  a  collection  of  observations  (V^,  V2,  ...,  Vj^)  with  missing 
values  from  such  a  population.  The  vectors  Vj  may  be  partitioned  as  (Xj,  Yj),  where  Xj 
and  Yj  are  the  vectors  of  categorical  variables,  and  continuous  variables  respectively,  and 
further  partitioned  as  (X^,  X2i,  Y^,  Y2i),  where  X^j  and  Y^  correspond  to  missing 
observations,  and  X2i  and  Y2i  correspond  to  available  observations.  (In  this  final 
partitioning,  the  dimensions  of  the  various  pieces  may  vary  with  i,  and  elements  may  be 
permuted  differently  for  each  i  according  to  the  pattern  of  missing  values  in  each 
observation.)  We  note  that  the  complete-data  sufficient  statistics  for  the  parameters  in 

this  model  are 
N,  =  I".,I(Xi  =  l), 

S,  =  =  x)Yi,  and  (i  s  'P)  (9) 

SS,  =  S;'.,I(Xi  =  s)YiY[, 
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and  that  the  MLEs  for  the  parameters  in  this  model  based  on  the  complete  data  are 


Px  Njj/n, 

Px  =  S^/Nj,  and  (x  e  'F)  (10) 

A  A  Aj 

Sjj.  —  SSjj./Njj  -  Pjj  . 


The  M-step  in  this  setting  simply  amounts  to  evaluating  each  of  the  pieces  of  (10). 
The  main  computational  burden  lies  in  computing  the  conditional  expectations  of  the 
complete  data  sufficient  statistics  given  the  available  data  under  current  parameter 
estimates  in  each  iteration  (the  E-step). 

^(k) 

In  the  E-step,  we  wish  to  compute  (under  the  distribution  defined  by  0'^ 

E[N^  I  {(X2i,  Y2i),  i  =  1,  n}]  =  S"=iE[I(Xi  =  x)  |  (X2i,  ¥2^)], 

E[S^  i  {(X2i,  Y2i),  i  =  1,  n}]  =  2:!LiE[I(Xi  =  x)Yi  |  (X2i,  ¥2;)],  and  (x  e  'F)  (11) 

E[SS^  I  {(X2i,  Y2i),  i  =  1,  n}]  =  l"=iE[I(Xi  =  x)YiYj^  |  (X2i,  ¥24)]. 

This  computation  is  facilitated  by  the  following  identities; 


E[I(X  =  x)h(Y)  I  (X2,  ¥2)]  =  Pr[X  =  x  j  (X2,  ¥2)]  •  E[h(Y)  |  X  =  x,  ¥2] 


Pr[X  =  xl(X2,Y2)]  = 


l(X,  =  x,)p,MVN(y^|^.,  Z,) 

E  I(X2  =  X9)pxMVN(y9 1  %) 

X€'P 


0)^v(12)rv(22X-E 


E[Yi  I  X  =  X,  ¥2]  =  (Y2 


(12) 

(13) 

(14) 


E[Y2lX  =  x,Y2]  =  Y2  (15) 

E[YiYj^lX  =  x,Y2]=  (16) 

^(1 1)_  ^(12)^^(22)^-1^(21)  +  e[Yi|  X  =  X,  Y2]E[YiI  X  =  X,  ¥2]'^ 

E[Yi ¥2  I X  =  X,  ¥2]  =  E[Yi  I X  =  X,  ¥2]  •  yJ 

48 


(17) 


E[Y2Y^|X  =  x,Y2]=Y2Y^ 


(18) 


Here,  \xf,  and  are  appropriate  partitions  of  and 

corresponding  to  the  missing  and  available  parts  of  Y.  Equations  (14)  -  (18)  are  used  to 
estimate  the  missing  parts  of  E[Y  |  X  =  x,  Y2]  and  E[YY'*'  1  X  =  x,  Y2].  Computation  of 
the  expectations  in  (1 1)  is  then  carried  out  as  follows:  For  each  observation  in  the  data 
set,  compute  E[I(Xj  =  x)  1  (X2i,  Y2i)],  E[I(Xj  =  x)Yj  |  (X2i,  Y2j)],  and  E[I(Xj  -  x)YjYj  \ 
(X^j,  Y2i)]  for  each  x  e  T  via  Equations  (12)  -  (18).  Accumulate  these  over  all 

observations  to  obtain  (1 1). 

In  the  case  of  continuous  variables  only,  (MOW)  used  estimates  of  parameters 
based  on  substituting  means  for  missing  observations  as  initial  estimates  in  the  iterative 
process.  This  becomes  more  complicated  in  the  presence  of  categorical  variables.  To 
simplify  initialization,  we  use  “blind  initialization”:  we  initialize  each  p^  with  1/r,  each 
with  0,  and  each  with  I.  Experience  so  far  indicates  that  the  first  iteration  of  the 
EM  algorithm  substantially  alters  the  parameter  estimates  to  something  comparable  to 
“mean  substitution,”  if  that  means  anything  in  this  context.  In  any  case,  this  initialization 
procedure  has  worked  adequately  so  far  in  simulation  studies. 

Having  evaluated  the  MLEs  using  the  EM  algorithm,  we  need  a  method  for 
evaluating  the  maximized  log-likelihood  functions  in  Equation  (6).  The  likelihood 
function  for  the  available  data  is  the  product  of  the  likelihoods  of  the  available  parts  of 
each  observation.  The  likelihood  of  the  available  part  of  a  single  observation  is  the 
marginal  density  for  (X2,  Y2).  This  may  be  obtained  from  the  density  for  (X,  Y)  by 
integrating  out  Xj  and  Y  j .  This  gives 

fxoYo(W2)=  ^  I(x2  =  X2)PxMVN(y2|Px,%)-  (19) 
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The  maximized  log-likelihood  of  the  available  data  in  a  sample  is  obtained  by 
accumulating  the  values  of  the  log  of  (19)  over  all  observations  in  the  sample.  Thus,  we 
may  evaluate  each  of  the  pieces  Xq2,  and  X12  in  Equation  (6),  from  which  we 
may  evaluate  the  test  statistic  X  =  log(LR)  given  in  (5). 

As  will  be  seen  in  the  simulation  results  of  Section  6,  the  FULL  Algorithm  has 
one  major  flaw  that  must  be  addressed.  That  is,  it  can  only  be  used  in  cases  in  which 
there  is  adequate  representation  in  each  cell  to  obtain  a  full-rank  estimate  of  for  each  x 
e  'F  in  both  populations.  If  r  is  large,  i.e.,  if  there  are  a  large  number  of  categorical 
variables,  or  a  large  number  of  categories  within  some  categorical  variables,  or  both,  then 
the  training  samples  may  need  to  be  extremely  large  so  that  all  parameters  can  be 
estimated  accurately.  In  practice,  such  large  samples  may  not  be  available,  and  it 
becomes  necessary  to  impose  further  constraints  on  the  parameters  of  the  model  so  that 
the  number  of  parameters  required  is  reduced.  This  leads  us  into  our  discussion  of  the 
next  algorithm. 

5.  The  COMMON  Algorithm 

This  last  algorithm  is  very  similar  to  the  FULL  Algorithm  except  that  in  our 

model  for  the  data,  we  assume  that  the  conditional  covariance  matrix  for  the  continuous 

part  given  the  discrete  part  is  common  for  all  x  €  'P.  That  is,  the  conditional  distribution 

of  Y  given  X  =  x  is  multivariate  normal  with  mean  covariance  matrix  and  2  not 

depending  on  x.  Hence,  0  =  {p^,  p^,  2;  x  s  'F} .  This  reduces  the  number  of  parameters 

that  need  to  be  estimated  considerably,  and  makes  parameter  estimation  possible  when 

some  cells  are  sparse,  or  not  represented  at  all.  We  allow  the  possibility  of  different 

parameters  for  each  of  the  two  populations,  but  within  each  population,  2  is  common 

across  all  multinomial  cells.  This  model  gives  precisely  the  general  location  model  of 

Olkin  and  Tate  (1961).  The  EM  algorithm  for  this  model  is  developed  by  Little  and 

Schluchter  (1985),  and  they  point  out  that  this  can  be  used  to  implement  the  GLRT 
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procedure  proposed  by  Krzanowski  (1982).  What  follows  is  precisely  this  procedure, 
with  the  added  feature  that  we  bootstrap  the  distribution  of  the  test  statistic  in  order  to 
choose  critical  values  to  control  the  P(2|l)  error  rate.  Although  Little  and  Schluchter 
(1985)  describe  the  EM  algorithm  for  this  model  in  considerable  detail,  we  present  a 
description  of  the  algorithm  here  that  is  consistent  with  the  notation  of  Section  4. 

First,  we  observe  that  the  complete-data  sufficient  statistics  for  the  parameters  in 
this  model  are 


=  s;LiI(Xi  =  X), 

Sx  =  and  (X  e  'F)  (20) 

SS  =  s”.,YiYj, 

and  that  the  MLEs  for  the  parameters  in  this  model  based  on  the  complete  data  are 
Px  ~  Njj/n, 

Px  =  Sx/Nx,and  (x  e  T),  (21) 

A  A  A 

S  —  2^xe'i'Px^x’ 

where  is  given  by  equations  (9)  and  (10).  In  other  words,  the  MLE  of  E  in  this  case  is 
precisely  a  weighted  average  of  MLEs  of  for  each  x  based  on  the  FULL  model  with 
weights  Px-  Hence,  we  may  perform  the  M-step  in  this  algorithm  with  exactly  the  same 

A 

formulas  as  the  M-step  in  the  FULL  Algorithm,  except  that  after  each  E^,  is  computed,  we 
average  these  according  to  (21)  to  obtain  the  updated  estimate  of  the  common  E.  The 
E-step  for  this  algorithm  is  also  identical  to  the  E-step  in  the  FULL  Algorithm,  except 

A  ^ 

that  throughout  formulas  (12)  -  (18),  each  E^  is  replaced  by  the  common  E.  The 
evaluation  of  the  maximized  log-likelihood  functions  in  (6)  is  also  performed  using  (19), 

A 

as  in  the  FULL  algorithm,  with  again,  the  only  difference  being  that  each  E,^  is  replaced 

A 

by  the  common  E. 
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6.  Simulation  Results 

We  have  performed  simulations  of  each  of  the  three  algorithms  (INDICATOR, 
FULL,  and  COMMON)  based  on  several  different  parameter  configurations  in  order  to 
determine  how  well  the  algorithm  controls  the  Type  I  misclassification  probability  as 
desired,  and  to  assess  the  power  P(2|2)  of  each  algorithm.  We  also  keep  track  of  how 
many  times  the  algorithm  fails  to  classify  the  observation  at  all.  These  failures  occm 
when  for  some  reason  the  simulated  data  fails  to  yield  full-rank  estimates  of  all  required 
covariance  matrix  parameters.  This  results  in  an  undefined  test  statistic  X.  This  happens 
most  frequently  in  the  FULL  algorithm,  and  is  caused  by  a  very  few  number  of 
observations  falling  into  one  or  more  of  the  multinomial  cells.  It  happens  occasionally  in 
the  INDICATOR  algorithm  when  at  least  one  possible  value  of  a  categorical  variable  is 
not  represented.  Failures  may  occur  when  the  test  statistic  is  undefined  for  the  sample 
which  we  are  trying  to  classify,  and  also  when  the  statistic  is  undefined  for  attempted 
bootstrap  samples.  We  see  in  our  simulations  that  the  COMMON  algorithm  is  least 
susceptible  to  these  types  of  failures. 

Our  first  simulation  involved  the  same  parameter  configurations  used  in  BGWMF 

(Case  3 :  Mixture  of  Categorical  and  Continuous  V ariables).  That  is,  we  consider  the  case 

in  which  the  categorical  part  is  a  single  Bernoulli  variable  and  the  continuous  part  is  a 

single  normal  random  variable  independent  of  the  categorical  variable.  For  population 

TTj,  the  Bernoulli  parameter  is  pj  =  0.1.  The  mean  and  variance  of  the  continuous 

variable  are  =  0  and  Gj  =  0.5,  respectively.  For  population  112,  we  use  P2  =  0.9,  0.7, 

and  0.5,  =  1-0,  and  IJ.2  =  0.5  +  Acr^  where  A  takes  on  values  0,  1,  2,  and  3.  The 

observed  significance  level  P(2|l)  is  the  proportion  of  times  out  of  500  simulated  trials  in 

which  the  variable  V  is  classified  into  7t2  when,  in  fact,  it  was  simulated  from  tcj.  The 

estimated  power  P(212)  is  the  proportion  of  times  out  of  500  simulated  trials  in  which  the 

variable  V  is  classified  into  112  when,  in  fact,  it  was  simulated  from  712-  In  order  to 

achieve  an  approximate  significance  level  of  ct  =  0.05,  the  variable  V  was  classified  into 
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3jC  ^ 

712  if*®  statistic  X  is  less  than  or  equal  to  ,  the  0.05-th  empirical  quantile  of  {Xj, 
*  * 

^2’  "■  ’ 

For  Nj  =  N2  =  50  and  B  =  99,  the  power  estimates  are  plotted  in  Figure  5,  based 
on  simulations  with  no  missing  data.  We  see  that  the  FULL  and  COMMON  algorithms 
agree  very  well  with  the  power  curves  plotted  in  BGWMF  (Figure  2),  In  fact,  with  no 
missing  data,  the  COMMON  algorithm  is  essentially  equivalent  to  the  method  of 
BGWMF,  so  these  simulation  results  should  agree  very  well,  as  they  do.  The 
INDICATOR  algorithm  does  not  agree  well  with  the  FULL  and  COMMON  algorithms. 
For  this  reason,  the  points  corresponding  to  the  INDICATOR  algorithm  are  not  connected 
with  lines,  since  this  would  clutter  the  plot.  It  would  seem  that  the  INDICATOR 
algorithm  has  higher  power  in  general  than  the  other  two.  This  is  surprising  since  this 
algorithm  does  not  model  well  the  true  distribution  of  the  binary  variable.  A  closer 
examination  of  the  simulation  results  shows  that  this  is,  in  fact,  misleading,  since  the 
INDICATOR  tends  to  yield  a  significance  level  nearly  twice  the  desired  0.05  level.  This 
can  be  seen  in  Figure  6,  which  shows  the  power  estimate  plotted  versus  the  observed 
significance  level.  Each  plot  in  Figure  6  corresponds  to  a  specific  value  of  A.  We  can 
also  see  in  Figure  6  that  the  COMMON  algorithm  most  accurately  achieves  the  desired  a 
=  0.05  significance  level. 

In  Figures  7  and  8,  we  show  corresponding  plots  based  on  data  with  missing 
values.  In  these  simulations,  each  variable  in  each  observation  was  deleted  independently 
with  probability  0.1,  so  that  roughly  10%  of  the  data  is  missing.  We  see  an  overall 
decrease  in  the  power  of  all  three  algorithms  compared  to  the  full-data  case,  but  this  is  to 
be  expected  since  the  test  is  based  on  less  available  data.  Otherwise,  the  results  of  the 
missing-data  case  are  comparable  to  the  results  of  the  full-data  case. 

We  have  tabulated  the  results  of  this  simulation  in  Table  1.  The  ERROR  column 

shows  the  percentage  of  times  out  of  the  500  simulations  that  the  algorithm  failed  to 

classify  V  due  to  singular  parameter  estimates.  We  see  that  the  FULL  algorithm  is  most 
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PROGRAM  CODE:  I  -  INDICATOR  F  -  FULL  C  -  COMMON 

Table  1 .  Listing  of  results  from  first  simulation. 


susceptible  to  failures  of  this  sort,  failing  as  much  as  6  to  7%  of  the  time  when  P2  -  0.9 
and  no  data  is  missing.  The  INDICATOR  algorithm  failed  somewhat  less  frequently,  and 
the  COMMON  algorithm  never  failed  in  this  study. 

In  our  next  simulation  study,  we  consider  the  case  in  which  we  have  two 
categorical  variables,  each  with  two  categories,  and  two  continuous  variables.  For 
population  tij,  each  possible  combination  of  the  categorical  part  (X  —  (1,1),  (1,2),  (2,1) 
and  (2,2))  occurs  with  probability  1/4.  The  conditional  distribution  of  the  continuous  part 

is  MVN(0,  Sj),  where 


within  each  multinomial  cell  {is.,  conditional  on  each  possible  value  of  the  discrete  part). 
For  population  conditional  covariance  matrix  for  the  continuous  part  is  I.2,  where 

22  is  given  by 


^2 


1  -0.5 

-0.5  1  . 


(23) 


We  use  three  different  probability  distributions  for  the  discrete  part,  and  four  different 
configurations  of  mean  vectors  for  the  conditional  distributions  of  the  continuous  part 
given  each  possible  discrete  part.  In  the  plots  and  tables  which  follow,  the  three 
probability  distributions  are  coded  with  the  variable  PCODE,  which  takes  on  values  1,  2, 
and  3.  The  four  mean  vector  configurations  are  coded  with  the  variable  MCODE,  which 
takes  on  values  1,  2,  3,  and  4.  The  parameter  configurations  defined  by  these  codes  are 
shown  in  Table  2. 
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PCODE 

Pr[X  =  (l,l)] 

Pr[X  =  (l,2)] 

Pr[X  =  (2, 1)] 

Pr[X  =  (2,2)] 

1 

0.25 

0.25 

0.25 

0.25 

2 

0.50 

0.20 

0.20 

0.10 

3 

0.80 

0.10 

0.10 

0.00 

MCODE 

E[Y|X  =  (1,1)] 

E[Y|X  =  (1,2)1 

EfYIX  =  (2,1)1 

E[Y|X  =  (2,  2)] 

1 

2 

3 

4 

(2,2) 

(2,2) 

(2,2) 

(2,2) 

Table  2.  Definitions  for  parameter  codes  used  in  out  second  simulation  study. 


PCODE  =  1  corresponds  to  a  uniform  distribution  across  all  multinomial  cells.  PCODE 
=  2  and  PCODE  =  3  correspond  to  distributions  increasingly  favoring  cell  (1,1). 
MCODE  =  1  corresponds  to  a  mean  configuration  identical  to  that  for  population  ttj. 
MCODE  =  2  and  MCODE  =  3  correspond  to  changes  in  mean  for  certain  cells,  and 
MCODE  =  4  corresponds  to  the  sum  of  these  two  changes.  For  PCODE  =  1  and 
MCODE  =  1,  population  is  identical  to  population  Uj  except  for  the  correlation 
between  the  two  continuous  variables. 

As  in  our  first  study,  we  take  N|  =  N2  =  50,  B  =  99,  a  =  0.05,  and  base  our 
observed  significance  level  and  power  estimates  on  500  replications  of  the  procedure  in 
each  case.  Figure  9  shows  the  power  estimates  plotted  versus  the  mean  configuration 
when  no  data  is  missing.  Figure  10  shows  plots  of  the  power  estimate  versus  the 
observed  significance  level  for  each  mean  configuration.  Figures  11  and  12  are 
corresponding  plots  for  approximately  10%  missing  data,  with  data  deleted  at  random  in 
the  same  manner  as  our  previous  study.  Table  3  shows  a  listing  of  these  results, 
including  the  percentages  of  failures  due  to  singular  parameter  estimates. 

In  Figure  9,  we  see  the  power  increases  in  general  as  the  separation  between  in 

means  increases  (i.e.,  as  MCODE  changes  from  1  to  4).  MCODE  =  2  and  MCODE  =  3 

actually  correspond  to  the  same  degree  of  difference  in  means,  so  the  power  for  these  are 
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PROGRAM  CODE:  I  -  INDICATOR  F  -  FULL  C  -  COMMON 

Table  3.  Listing  of  results  from  second  simulation  study. 


not  expected  to  be  too  different.  In  fact,  the  power  estimates  for  MCODE  =  2  and 
MCODE  =  3  are  very  similar  when  PCODE  =  1.  However,  they  are  not  very  similar 
when  PCODE  =  2  or  3.  In  this  case,  power  is  lower  for  MCODE  =  3  than  for  MCODE  = 
2.  This  results  since  for  MCODE  =  3,  the  means  differ  in  sparse  cells,  whereas  for 
MCODE  =  2,  the  means  differ  most  in  the  most  common  cell  (corresponding  to  X  = 
(1,1)),  making  it  more  easy  to  differentiate  between  the  two  populations.  We  see  similar 
patterns  in  Figure  1 1,  and  can  also  see  a  general  decrease  in  power  due  to  missing  values. 

These  plots  seem  to  indicate  that  the  INDICATOR  and  COMMON  algorithms 
have  very  similar  power,  these  being  generally  better  than  the  FULL  algorithm.  As  in  our 
first  study,  we  notice  in  Figures  10  and  12  that  the  INDICATOR  algorithm  has  a 
tendency  to  yield  a  higher  significance  level  than  desired,  especially  when  PCODE  =  3 
(z.e.,  when  some  cells  are  very  sparse). 

We  see  from  Table  3  that  the  INDICATOR  algorithm  fails  occasionally  due  to 
singular  covariance  matrix,  especially  when  PCODE  =  3.  The  FULL  algorithm  does 
much  worse  when  cells  are  sparse.  The  FULL  algorithm  fails  about  two-thirds  of  the 
time  when  PCODE  =  3  and  data  is  missing!  When  some  cells  occur  with  very  low 
probability,  it  is  necessary  to  have  very  large  samples  so  that  each  cell  is  represented 
enough  to  obtain  a  full-rank  estimate  of  the  covariance  matrix  within  that  cell.  Samples 
of  size  50,  cells  are  not  adequately  represented  about  2/3  of  the  time.  Once  again,  we  see 
that  the  COMMON  algorithm  is  least  susceptible  to  failures  due  to  singular  parameter 
estimates. 

Readers  may  wonder  why  the  algorithm  doesn't  foil  every  time  for  PCODE  =  3 

since  cell  (1,1)  is  never  represented.  If  a  cell  probability  is  estimated  to  be  zero,  the 

covariance  matrix  estimate  for  that  cell  is  never  used  in  the  computation  of  X,  and  can 

therefore  be  disregarded.  It  is  not  cell  (1,1)  that  is  the  problem  here,  rather  it  is  cells  (0,1) 

and  (1,0).  Readers  may  also  find  it  strange  that  for  PCODE  =  2,  there  are  fewer  failures 

in  the  FULL  algorithm  when  data  is  missing  than  when  all  data  is  available.  This  may  be 
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explained  intuitively  as  follows:  When  data  is  missing  in  the  discrete  part,  there  is  some 
possibility  that  the  observation  falls  into  any  of  a  number  of  cells.  This  observation 
contributes  to  the  parameter  estimates  for  all  cells  to  which  the  observation  might  truly 
belong,  resulting  in  fewer  rank  problems  in  sparse  cells. 

7.  Concluding  Remarks 

In  this  report,  we  have  extended  the  results  of  BGWMF  and  MGW  to  perform 
discriminant  analysis  with  categorical  and  continuous  variables  when  data  is  missing. 
We  presented  three  algorithms  for  doing  so.  In  simulation  studies,  we  have  observed  that 
the  INDICATOR  algorithm  has  a  tendency  to  yield  a  higher  Type  I  error  rate  than 
desired.  The  FULL  algorithm  often  fails  due  to  singular  parameter  estimates  when  some 
value  of  the  discrete  part  is  sparsely  represented.  The  COMMON  algorithm  seems  to 
avoid  these  problems,  and  is  therefore  the  preferred  algorithm,  especially  when  samples 
are  small  and  the  assumption  of  a  common  covariance  matrix  across  all  multinomial  cells 
is  reasonable.  The  code  has  now  been  transferred  to  MRC  and  Dr.  Mark  Fisk  is  applying 
these  techniques  to  some  existing  seismic  data. 
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Figure  1.  Flowchart  for  the  GLRT  procedure  for  discriminant  analysis 


Figure  2.  Flowchart  for  the  bootstrapping  portion  of  the  GLRT  procedure. 
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Figure  3.  Flowchart  for  computing  the  test  statistic  X  in  the  GLRT  procedure. 


Figure  4.  Flowchart  for  the  EM  algorithm. 
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Figure  5.  Power  estimates  when  no  data  is  missing  for  each  of  the  three  algorithms,  with 
data  modeled  as  a  Bernoulli  random  variable  and  an  independent  normal  random  variable. 

2 

Parameters  for  population  tIj  are  pj  =  0.1,  |ij  =  0,  and  Oj  =  0.5.  Power  estimates  are 

2 

based  on  the  following  configurations  for  population  712-  P2  “ 

and  |i2  ~  where  A  takes  on  values  0,  1,  2,  and  3.  For  each  value  of  A,  the 

symbols  I,  F,  and  L  are  plotted  at  the  corresponding  power. 
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Figure  6.  Plots  of  estimated  power  versus  observed  significance  for  each  value  of  A  when  no  data  is  missing.  Symbol  used  is  I  for 
INDICATOR  algorithm,  F  for  FULL,.and  C  for  COMMON.  The  solid  line  corresponds  to  pj  =  0.9,  the  dashed  line  to  P2  =  0.7,  and 
the  dotted  line  to  p2  =  0.5.  The  symbols  I,  F,  and  C  are  plotted  at  positions  associated  v^dth  the  ordered  pairs  (significance  level, 
power).  The  lines  are  not  indicative  of  significance  level  or  power  but  are  only  drawn  to  identify  the  ordered  pairs  associated  with  a 
particular  value  of  P2. 


I  -  INDICATOR 
F  -  FULL 
C  -  COMMON 


P2  =  0.9 
P2  =  0.7 
P2  =  0.5 


Figure  7.  Power  estimates  when  approximately  10%  of  the  data  is  missing  for  each  of  the 
three  algorithms,  with  data  modeled  as  a  Bernoulli  random  variable  and  an  independent 

2 

normal  random  variable.  Parameters  for  population  are  p^  =  0.1,  |Xj  =  0,  and  =  0.5. 

Power  estimates  are  based  on  the  following  configurations  for  population  %2-  P2  ~ 

2  2 

0.7,  and  0.5,  02  =  1.0,  and  |i2  “  0.5  +  Aa2,  where  A  takes  on  values  0,  1,  2,  and  3. 
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Figure  8.  Plots  of  estimated  power  versus  observed  significance  for  each  value  of  A  when  10%  of  the  data  is  missing.  Symbol  used  is  I 
for  INDICATOR  algorithm,  F  for  FULL,  and  C  for  COMMON.  The  solid  line  corresponds  to  P2  =  0.9,  the  dashed  line  to  P2  =  0.7, 
and  the  dotted  line  to  P2  =  0.5. 


Figure  9.  Power  estimates  when  no  data  is  missing  for  each  of  the  three  algorithms  when 
data  have  two  binary  and  two  continuous  variates,  possibly  dependent.  For  population  7t  j, 
each  possible  combination  of  the  binary  part  occurs  with  probability  1/4.  The  condition^ 
distribution  of  the  continuous  part  is  MVN(0,  Sj),  where  Zj  is  a  2x2  matrix  with  diagonal 
elements  of  one  and  off-diagonal  elements  of  0.5,  within  each  multinomial  cell.  For 
population  7:2,  the  conditional  covariance  matrix  for  the  continuous  part  is  Z2,  where  Z2 
has  ones  on  the  diagonal  and  off-diagonal  elements  of  -0.5.  Several  distributions  for  the 
discrete  part,  and  several  choices  of  mean  vectors  are  used,  as  defined  in  Table  2. 
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(c)  MCODE  =  3  (d)  MCODE  =  4 

Figure  10.  Plots  of  estimated  power  versus  observed  significance  for  each  mean  configuration  in  our  second  study  when  no  data  is 
missing.  Symbol  used  is  I  for  INDICATOR  algorithm,  F  for  FULL,  and  C  for  COMMON.  The  solid  line  corresponds  to  PCODE  =  1, 
the  dashed  line  to  PCODE  =  2,  and  the  dotted  line  to  PCODE  =  3 . 


Figure  11.  Power  estimates  when  approximately  10%  of  the  data  is  missing  for  each  of 
the  three  algorithms  when  data  have  two  binary  and  two  continuous  variates,  possibly 
dependent.  For  population  Jt^,  each  possible  combination  of  the  binary  part  occurs  with 
probability  1/4.  The  conditional  distribution  of  the  continuous  part  is  MVN(0,  2j),  where 
is  a  2x2  matrix  with  diagonal  elements  of  one  and  oflf-diagonal  elements  of  0.5,  vdthin 
each  multinomial  cell.  For  population  1Z2,  the  conditional  covariance  matrix  for  the 
continuous  part  is  S2,  where  I2  diagonal  and  off-diagonal  elements  of 

-0.5.  Several  distributions  for  the  discrete  part,  and  several  choices  of  mean  vectors  are 
used,  as  defined  in  Table  2. 
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(c)  MC0DE  =  3  (d)  MCODE  =  4 

Figure  12.  Plots  of  estimated  power  versus  observed  significance  for  each  mean  configuration  in  our  second  study  when  10%  of  the 
data  is  missing.  Symbol  used  is  I  for  INDICATOR  algorithm,  F  for  FULL,  and  C  for  COMMON.  The  solid  line  corresponds  to 
PCODE  =  1,  the  dashed  line  to  PCODE  =  2,  and  the  dotted  line  to  PCODE  =  3. 


Outlier  Tests  with  Multiple  Stations 
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Southern  Methodist  University 

August  25,  1995 


Abstract 

Some  techniques  are  discussed  for  dealing  with  the  problem  of  dis¬ 
tinguishing  between  earthquakes  and  explosions  when  data  are  avail¬ 
able  at  more  than  one  station.  A  simulation  study,  in  which  the 
performance  of  these  techniques  are  compared,  is  presented. 
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1  Introduction 

We  consider  the  problem  of  observing  seismic  events  for  the  purpose  of  dis¬ 
tinguishing  between  earthquakes  and  explosions.  Baek,  et  al.  (1994)  treat 
this  as  an  outlier  problem  which  is  to  determine  whether  a  new  and  possi¬ 
bly  a  suspicious  event  should  be  classified  as  an  earthquake,  given  a  training 
sample  of  data  on  earthquakes.  In  that  paper,  it  was  assumed  that  data  on 
several  variables  were  available  at  a  given  station.  It  was  also  assumed  that 
the  variables  might  be  either  continuous,  discrete  or  a  mixture  of  both  types. 

In  this  report,  we  address  the  issue  of  outlier  testing  when  data  are  at 
more  than  one  station.  In  particular,  we  suppose  that  there  are  p  feature 
variables  observed  at  each  of  m  stations.  A  fundamental  problem  is  how  to 
utilize  the  information  from  multiple  stations  in  a  test  for  outliers. 


1.1  Test  1:  Full  Vector  Approach 


One  technique  for  outlier  detection  in  the  multi-station  case  is  to  consider 
the  p  features  at  each  of  xn  stations  as  a  single  vector  consisting  of  mp 
variables.  That  is,  the  observation  vector  for  the  i  th  event  in  the  training 
sample  is  an  mpxl  vector  of  the  form 


Xj  =  (A’lii,  A'i2»>  •  •  •  A2li.  A'22i;  •  •  •  - ,  Apii,  Ap2M  •  •  •  :  Apmi )  ; 

i  =  l,...,u.  where  Xjki  indicates  the  ;  th  feature  measured  at  the  fcth 
station  for  the  i  th  earthquake.  A  new  observation  to  be  tested  as  an  outliei 
has  then  a  similarly  configured  mpxl  vector  of  the  form 


Xn+i  =  (Aii.n+1,  Ai2.n+l! - A'im,n+'l^'  •  •  •  Apl.n+1? -^p2,n+l?  •  •  •  •  Ap^.n+l  )  • 

We  consider  the  training  sample  {X,  }-Li  to  be  from  the  density  function 
/(.;/ii,S),  where 

/(X;  A^i,  S)  =  (2  vr)-^^  exp  |  -  Mi)'  S"'  (X  -  Mi)|  , 

i.e.,  we  are  assuming  in  this  report  that  the  feature  variables  have  a  mul¬ 
tivariate  normal  distribution.  Similarly,  the  new  X„+i  is  assumed  to  have 
probability  density  /(.;  Baek,  et  al.  (1994)  classify  X^+i  b\  testing 

the  hypotheses 


ffo  :  Ml  =  M2 
•^1  •  Ml  ^  M2' 


The  likelihood  of  Xi,  X2, . . . , Xn, X„+i  is  given  by 

L(0;Xi,...,Xn+i)  =  T(0;Xi,...,Xn)  (27r)-'^lSl-2 

exp  I  -  -  (X„+i  -  ^2)'^”^  (Xn+i  -  M2) I  ’ 
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where  X(0;  Xi, . . .  ,Xn)  =  HLi  /(X,;/:ti,S),  the  likelihood  of  Xi,X2, . . .  ,X„, 
and  0  =  (/ii,/X2,S).  The  generalized  likelihood  ratio  is  therefore  defined  by 

^  X(0;  Xi, . . .  ,Xn+i) 

T(0,  Xi, . . . , Xji-i-i) 

(1) 

_  T(^o;  Xi,  ■  ■  ■  ,Xn.|-i) 

T(0;Xi,...,X„+i)  ’ 

where  is  the  maximum  likelihood  estimate  (MLE)  of  0  under  the  restric¬ 
tion  that  Ho  is  true,  and  9  =  where  and  S  are  the  MLE's  of 

and  S  based  on  Xi,X2,...,X„  and  /tij  =  X^+i.  It  intuitively  follows 
that  small  values  of  A  provide  evidence  against  Hq^  and  thus  the  generalized 
likelihood  ratio  test  is 


reject  Hq  if  A  <  A(a)  ,  (2) 

where  A(q;)  is  chosen  to  provide  a  size  a  test.  In  the  setting  considered  by 
Baek,  et  al.  (1994),  i.e.  the  data  are  a  mixture  of  continuous  and  discrete 
types,  the  distribution  of  A  is  unknown:  and  bootstrap  techniques  were 
applied  in  order  to  ascertain  this  distribution.  Baek,  et  al.  (1994)  point  out 
that  when  all  classification  variables  are  continuous  and  have  a  multivariate 
normal  distribution,  the  setting  considered  here,  the  distribution  of  A  is 
known;  and  the  critical  values  can  be  Found  based  on  the  F-distribution.  In 
particular,  in  the  current  case  of  mp  variables  and  a  training  sample  of  size 
?r,  A(a)  is  given  by 


A(a) 


TU  p  Fa 

n  —  mp 


(3) 


where  Fa  is  the  (1  —  a)  th  percentile  of  the  F-distribution  with  mp  and 
n  —  mp  degrees  of  freedom.  In  the  multivariate  normal  ceise,  calculating  the 
critical  value  A(q:)  by  using  the  bootstrap  procedure  or  from  (3)  produced 
very  similar  results^  More  details  concerning  the  derivation  of  (3)  are  given 
in  Fisk  (1995). 

In  the  full  vector  approach,  no  attempt  is  made  to  account  for  the  fact 
that  the  same  p  variables  are  being  measured  at  the  m  stations.  It  should 
be  noted  that  this  solution  strategy  actually  does  not  require  the  same  p 
variables  to  be  observed  at  each  of  the  m  stations. 


1.2  Test  2:  Minimum  Variance  Weighting 

A  second  method  that  will  be  examined  here  is  the  combining  of  features 
across  stations  by  using  minimum  variance  weighting.  This  procedure  is 

^  These  simulations  results  will  not  be  given  here. 
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designed  to  reduce  the  dimensionality  of  the  problem  by  taking  advantage 
of  the  correlation  structure  between  stations.  We  construct  a  new  feature 
Yj  associated  with  feature  j.  The  new  feature  is  the  linear  combination  of 
feature  j  at  each  of  the  m  stations,  i.e., 


Yi  =  Y,  "i-.  .  W 

k:=l 

which  minimizes  the  variance  of  Vj  subject  to  the  constraint  that  the  weights 
sum  to  one.  Theoretically,  the  weights  are  given  by 


(5) 


where  /3  =  (1, 1, . . . ,  1)'  and  Sj  is  the  covariance  matrix  of  Aji,  .  .  .  ,  Xjm- 
In  practice,  Sj  will  not  be  known,  and  will  be  estimated  by  the  usual  sample 
covariance  matrix  based  on  events  i  =  1, . . .  ,n.  This  weighting  is  not  based 
on  the  assumption  that  the  means  for  the  j  th  feature  are  constant  across  sta¬ 
tions,  but  rather  is  combining  the  data  across  stations  to  create  a  new  feature. 
This  procedure  creates  a  new  p-dimensional  vector  Y;  =  (lii, . . . , 

i  =  1 _ ,n.  The  (n  -f  I  jst  event,  which  is  to  be  classified  as  a  possible 

outlier,  is  weighted  by  using  the  same  weights,  i.e., 


Ij.n+l  —  ^jk  ^jk,n+l  •  (®) 

k=l 

This  weighting  reduces  the  dimension  from  mp  variables  to  p  variables.  The 
outlier  detection  is  then  based  on  a  likelihood  ratio  as  before  but  calculated 
using  only  the  p  new  variables.  It  should  be  noted  that  although  the  weights 
are  stochastic  and  depend  on  the  data,  for  feature  j  the  same  m  station 
weights,  .  k  =  are  used  for  each  of  the  events.  Thus,  the 

resulting  new  p  features  will  be  approximately  normally  distributed  random 
variables. 


1.3  Test  3:  Separate  Tests  Based  on  Each  Station 

It  is  possible  for  the  test  based  on  all  stations  to  fail  to  declare  an  event  to 
be  an  outlier  while  the  test  based  on  one  of  the  individual  stations  finds  the 
event  to  be  an  outlier.  It  seems  plausible  that  a  very  noisy  station  could  result 
in  the  multi-station  test  losing  power  compared  to  that  associated  with  an 
individual  “good”  station^.  The  question  is  whether  multi-station  tests  are 
more  powerful  than  simple  use  of  individual  station-based  tests.  An  obvious 
strategy  for  using  station  information  at  m  stations  is  to  declare  an  event 

2The  identification  of  “good”  stations  for  use  in  the  test  is  under  investigation  and  will 
be  the  subject  of  the  future  report. 
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to  be  an  outlier  if  any  of  the  individual  station-based  tests  finds  the  event 
to  be  an  outlier.  This  has  the  apparent  advantage  that  if,  for  example,  there 
are  two  stations  with  one  of  the  stations  quite  noisy,  then  the  test  has  not 
been  penalized  for  inclusion  of  the  noisy  station,  as  may  be  the  case  with  the 
other  two  tests  considered.  However,  if  each  of  the  tests  at  the  Tn  stations  is 
run  at  the  a  =  0.05  level,  the  result  of  such  a  procedure  will  be  a  test  with 
significance  level  larger  than  ct,  and  according  to  Bonferroni  s  inequality,  the 
overall  significance  level  is  less  than  or  equal  to  ttioc  .  Thus,  in  order  to  assure 
that  the  overall  significance  level  is  no  more  than  a,  each  individual  station- 
based  test  should  be  run  at  level  a/m.  In  the  example  of  two  stations,  one 
good  and  one  noisy,  this  procedure  also  provides  a  penalty  for  incorporating 
the  noisy  station  since  it  requires  implementation  of  a  smaller  significance 
level,  a  =  0.025,  for  each  test.  Thus,  if  the  second  station  is  of  no  usefulness, 
then  the  result  of  the  procedure  is  to  reduce  the  power  over  that  of  identifying 
the  useful  station  and  appljdng  the  test  using  only  that  station.  Note  that 
the  weighting  done  here  is  optimal  (in  the  estimation  sense)  for  the  case  in 
which  the  variables  at  different  stations  are  independent,  which  is  not  the 
case  here. 

In  the  following  section,  we  present  some  simulation  results  in  which  we 
compare  the  power  of  the  three  outlier  testing  procedures  described  here  in 
order  to  examine  the  conditions  for  which  a  particular  test  is  favored  over 
the  others. 


2  Simulation  Examination  of  the  Tests 

2.1  Two  Stations  and  One  Variable 

In  this  subsection,  we  consider  the  2— dimensional  case  of  two  stations  and 
one  variable  measured  at  each  station.  For  the  population  of  earthquakes, 
we  assume  that 


Xi  =  (Xixi,  A'i20'  ~  MVN{fx,,i:) ,  (7) 


where 

fii  =  with  /ill  =  /ii2  =  0 


and 


for  p  =  -0.25, 0.0, 0.25, 0.50, 0.75  . 


In  Table  1,  we  present  estimated  power  of  the  full  vector,  minimum  vari¬ 
ance  weighting,  and  separate  station— based  tests  when  the  potential  outlier 
is  from 


~  (8) 
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where  1^12)'-  ^  sample  of  size  n  =  50  is  generated 

from  the  earthquake  distribution,  i.e.,  MVN{{0,0y ,1,),  and  a  single  outlier 
is  generated  from  the  outlier  distribution,  i.e.,  S)  for  a 

variety  of  values  of  ix[f  and  The  test  determines  whether  this  single 

observation  is  classified  as  an  outlier.  The  entire  procedure  is  repeated  1000 
times,  and  the  power  estimates  given  in  the  table  are  the  proportion  of  times 
that  the  single  observation  from  was  called  as  an 

outlier.  All  tests  presented  in  this  report  were  run  at  the  a  =  0.05  overall 
significance  level. 

We  first  focus  on  the  full  vector  test.  Test  1.  There  we  can  see  in  general, 
as  and  become  further  removed  from  the  null  values  of  0,  the  power 

of  Test  1  increases  as  would  be  expected.  However,  some  of  the  results  in  the 
table  may  seem  nonintuitive  at  first  glance.  For  example,  when  p  =  0.75  the 
power  associated  with  the  alternative  =  2  and  =  0  is  0.738  which 
is  much  higher  than  for  the  case  p  =  0  in  which  the  power  is  0.377.  In  some 
respects  this  seems  to  be  an  unusual  result  since  intiutively  it  would  seem  that 
highly  correlated  variables  (stations)  would  tend  to  be  providing  redundant 
information  and  hence  might  be  expected  to  yield  lower  power  than  in  the 
case  in  which  the  correlation  is  smaller.  However,  it  should  be  pointed  out 
that  while  increased  correlation  reduces  information  in  estimation,  it  may 
dramatically  increase  information  for  purposes  of  outlier  detection.  Thus, 
it  is  important  to  note  that  the  shape  of  the  bivariate  distribution  plays 
a  major  role  in  determining  this  power,  i.e.,  in  determining  what  types  of 
values  appear  to  be  outliers.  In  Figure  1,  we  show  contour  plots  of  the 
bivariate  distributions  assumed  under  the  null  hypothesis  for  the  values  of 
p  considered.  There  it  can  be  seen  that  observations  around  =  2  and 
=  0  for  station  1  and  station  2.  respectively,  are  much  more  unlikely 
when  p  =  0.75  than  for  lower  values  of  p.  Interpretation  of  other  powers 
shown  in  Table  1  is  aided  by  examination  of  Figure  1. 

It  can  be  seen  from  Table  1  that  the  minimum  variance  weighting  test, 
Test  2,  results  are  sometimes  comparable  and  in  some  instances  superior 
to  those  obtained  using  Test  1,  the  full  vector  approach.  However,  it  is 
also  noted  that  in  some  cases  these  powers  can  be  much  worse  than  those 
obtained  using  Test  1.  In  order  to  understand  this  phenomenon,  consider 
again  the  case  in  which  ^  ~  0  with  p  =  0.75.  The  effect 

of  the  minimum  variance  weighting  is  to  produce  new  feature  Ti  calculated 
as  Yxi  =  oJnXiu  +^12X12^,  i  =  l,...,n.  In  this  case,  the  weights  will 
both  be  approximately  equal  so  that  the  mean  of  Yi  will  be  about  1,  and 
in  Table  1  it  is  seen  that  the  power  of  Test  2  is  only  0.185  when  p  =  0.75 
as  compared  to  0.738  using  Test  1.  From  Table  1  we  see  that  for 
relatively  close  to  ^42^  Test  2  tends  to  have  higher  power  than  Test  1.  It  is 
clear  from  Table  1  that  if  Test  2  care  must  be  taken.  In  the  case  of  positive 
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correlations  between  stations  Test  2  will  tend  to  perform  poorly  when  the 
potential  outliers  are  not  consistent  with  this  correlation  structure.  The  fact 
that  (2,0)'  is  not  consistent  with  the  correlation  structure  can  be  measured 
using  the  Mahalanobis  distance,  defined  as 

between  the  null  and  the  alternative  populations.  Thus,  although  the  points 
(2, 0)'  and  (\/2,  V^)'  have  the  same  squared  Euclidean  distance,  4,  from  the 
null  mean  of  (0, 0)',  the  Mahalanobis  distance  for  (2, 0)'  is  4  times  as  large  as 
that  for  (\/2,  \/2)'  whenever  p  =  0.75.  Thus,  when  the  correlations  between 
stations  are  positive,  a  large  Mahalanobis  distance,  compared  to  the  range  of 
Mahalanobis  distances  possible  for  a  given  Euclidean  distance,  should  serve 
as  an  indication  that  Test  2  should  not  be  used.  The  procedure  we  used  in 
the  simulations  for  assessing  whether  minimum  variance  weighting  should  be 
used  in  a  situation  in  which  the  correlations  between  stations  are  positive  is 
given  below: 

1.  Calculate  the  Euclidean  distance  ds  —  (/^i  — 

the  Mahalanobis  distance  du  =  (Mi  —  .'Yn+i)' — A„+i)  between 
the  null  mean  and  the  potential  outlier,  (for  and  S,  use  sample 
values  calculated  from  the  training  sample). 

2.  Calculate  the  minimum,  and  first  quartile,  dfj^\  of  all  possible 

Mahalanobis  distances  associated  with  means  separated  by  a  Euclidean 
distance  ds- 

3.  Whenever  <  2  and  d,v/  <  d-^'\  minimum  variance  weight¬ 

ing  is  appropriate.  Othei’wise.  the  full  vector  approach  is  recommended. 

In  Table  1.  we  give  the  power  of  a  "combined  test".  Test  4,  which,  for  a 
given  training  sample  and  potential  outlier,  uses  Test  1  or  Test  2  as  indicated 
by  the  Mahalanobis  distance  criterion  mentioned  above.  The  Mahalanobis 
distance  criterion  is  specifically  designed  for  the  case  in  which  there  is  positive 
correlation.  It  can  be  seen  that  in  cases  in  which  p  >  0.25,  Test  4  often 
performed  better  than  the  other  three  tests  and  always  had  close  to  the 
highest  power.  Test  4  did  not  perform  as  well  for  p  =  0.  However,  this  test 
is  designed  for  the  case  in  which  the  correlation  is  positive,  and  the  results 
for  p  =  Q  are  included  only  for  comparison.  It  should  be  noted  that,  in 
practice,  the  decision  concerning  whether  to  use  the  Mahalanobis  check  will 
be  based  on  the  correlations  calculated  from  the  training  sample  data;  and 
some  rules  should  be  obtained  for  deciding  how  large  a  sample  correlation 
should  be  before  the  Mahalanobis  check  is  performed.  It  should  be  noted 
that  Test  4  tends  to  have  slightly  larger  significance  level,  (i.e.,  power  at  the 
alternative  (0, 0)' )  than  the  nominal  level  of  a  =  0.05. 
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In  Table  1,  we  also  give  the  power  results  associated  with  Test  3.  In 
variable  at  each  of  two  stations,  the  associated  univariate  test  is  run  for  each 
of  the  two  stations  at  a  =  0.025,  and  an  outlier  is  said  to  be  detected  if 
either  of  the  two  univariate  tests  determines  the  event  to  be  an  outlier.  As 
indicated  earlier,  this  assures  that  the  overall  test  has  significance  level  no 
more  than  a  =  0.05.  In  the  table  we  see  that  the  power  is  competitive  with 
Tests  1  and  4  except  in  the  case  in  which  the  correlation  is  large  ( p  >  0.50) 
and  one  of  or  is  close  to  the  null  value  of  zero. 

In  Table  2,  we  present  simulation  results  for  a  case  in  which  the  station 
variances  are  not  equal.  In  particular,  we  consider  the  case  in  which 

S  =  (  ^  with  p=  -0.25,0.0,0.25,0.50,0.75  .  (10) 

Figure  2  shows  the  contour  plots  of  the  bivariate  distributions  assumed  under 
the  null  hypothesis  for  the  values  of  p  considered.  In  this  setting,  we  observe 
that  Test  1  and  Test  2  behave  similarly  in  the  sense  that,  in  general,  as 
and  ^12^  become  further  apart  from  the  null  values  of  0,  and  as  correlation 
between  stations  increases,  the  powers  of  these  tests  increase.  As  in  Table 
1.  while  Test  2  results  are  often  comparable  or  even  superior  to  Test  1.  for 
some  cases.  Test  2  results  are  much  worse  than  those  for  Test  1.  This  seems 
particularly  true  for  values  of  =  0  and  >  0  ''’hich,  as  can  be  seen 
in  Figure  2.  are  values  that  do  not  correspond  to  the  correlation  structure. 
Using  the  Mahalanobis  distance  criterion  as  in  Table  1  we  obtained  Test  4 
which  still  has  substantially  lower  power  than  Test  1  in  the  cases  where  Test 
2  power  is  much  less  than  Test  1.  As  in  Table  1  we  see  that  Test  1  has  power 
which  is  always  competitive  with  the  best  shown  in  the  table  whereas  each 
of  the  other  tests  can  in  some  instances  have  substantially  lower  power  than 
Test  1. 

\Ne  also  have  examined  the  use  of  outlier  tests  using  model  parameters 
obtained  from  actual  seismic  data.  Data  are  available  on  36  earthquakes  and 
70  explosions  for  the  logarithm  of  Pn/Lg(6-8Hz)  at  two  stations,  KNB  and 
MNV.  Letting  KNB  be  station  1  and  MNV  station  2,  the  sample  mean  vector 
and  covariance  matrix  for  this  training  sample  of  earthquakes  is  given  by 

Xi  =  (Xii,Ai2)^  =  (-0.743,-1.672)  and  Ei  =  ^  0.046  0.173  )  ’ 
while  the  corresponding  quantities  for  the  training  sample  of  explosions  are 
X2  =  (0.361, -0.010)'  and  S2  =  (^ 

The  estimated  correlation  between  stations  is  found  to  be  0.23  earthquakes; 
and  0.29  for  the  training  samples  of  explosions.  The  contour  plots  for  the 
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bivariate  normals  with  means  and  covariances  set  equal  to  those  observed 
for  the  two  training  samples  are  given  in  Figure  3.  There  it  can  be  seen 
that  there  is  a  substantial  separation  between  the  two  populations;  and  it 
would  be  expected  that  the  outlier  test  should  be  able  to  detect  explosions  as 
outliers  related  to  the  earthquake  population.  In  order  to  examine  this,  we 
performed  the  outlier  tests  using  samples  of  length  50  simulated  from  the 
model  with  =  Xi  and  Si  =  Si  with  outliers  simulated  from  the  bivariate 
normal  model  with  H2  =  X2  and  S2  =  S2.  In  this  setting,  an  outlier  was 
detected  95.3%,  95.6%,  97.5%  and  96.2%  of  the  time  by  using  the  separate 
station-based,  full  vector,  minimum  variance  weighting  and  combined  full 
vector-minimum  variance  weighting  tests,  respectively. 


2.2  Two  Stations  and  Two  Variables 

In  this  subsection,  we  briefly  consider  the  case  in  which  there  are  two  feature 
variables  measured  at  each  of  two  stations. 


To  compare  the  performances  of  Test  1,  Test  2,  Test  3  and  Test  4  of 
section  1  under  this  setting,  we  have  carried  out  some  simulations  on  data 
generated  from  multivariate  normal  distribution  with  various  mean  vectors 
and  covariance  matrices.  From  the  simulation  study,  we  have  observed  that, 
as  in  the  case  of  two  stations-one  variable,  the  estimated  powers  of  Test  2 
are  sometimes  comparable  and  in  some  cases  superior  to  those  of  Test  1;  and 
there  are  also  some  cases  in  which  powers  of  Test  2  are  much  worse  than 
those  of  Test  1.  The  combined  test,  Test  4,  performs  fairly  well,  but  it  is 
clear  that  the  Mahalanobis  decision  rule  on  page  8  may  not  be  optimal  for  a 
wide  range  of  parameter  configurations. 

We  also  have  examined  the  powers  of  the  outlier  tests  of  section  1  using 
model  parameters  obtained  from  actual  seismic  data.  In  the  simulations,  we 

assume  the  observations  Xj- =  (A'ni,  Aiof.  ^21;,  A22i)%  i  =  l - ,n  are  from 

MVN{fXi,T,).  Data  are  available  on  36  earthquakes  and  70  explosions  for 
the  logarithm  of  Pn/Lg  ratio  in  both  the  4-6Hz  and  6-8Hz  frequency  bands. 
We  let  station  1  denote  KNB  and  station  2  denote  MNV,  and  we  take  the 
log  Pn/Lg  ratio  in  the  6-8Hz  and  4-6Hz  frequency  bands  as  features  1  and  2, 
respectively.  The  sample  mean  vector  and  covariance  matrix  for  this  training 
sample  of  earthquakes  is  given  by 

Xi  =  (Xn,Xi2,X2i,X22y  =  (-0.712,-0.992,-1.657,-1.829)'  (13) 


and 
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while  the  corresponding  quantities  for  the  training  sample  of  explosions  are 


X2  = 

(0.362,- 

-0.155,- 

■0.032,  - 

-0.453)' 

/  0.197 

0.138 

0.054 

0.012 

tlo  — 

0.138 

0.195 

0.050 

-0.009 

^2  — 

0.054 

0.050 

0.180 

0.093 

0.012 

-0.009 

0.093 

0.207 

(14) 


Let  Pjj  denote  the  estimated  correlation  between  features  ji  and  ;2  at 
stations  'll  and  ^2.  These  quantities  for  the  training  sample  of  earthquakes 
are 


=  0.3  ,  p['f  =  0.2.3  ,  =  0.43  ,  pff  =  0.78  ,  =  0.19  .  (1.3) 


The  corresponding  quantities  for  the  training  sample  of  explosions  are 

=  -0.03 , =  0.28  ,  M!)'’  =  0-48  .  =  C-™  •  A'f  =  0-06  .(16) 


Figure  4  displays  the  contour  plots  for  the  bivariate  normals  with  means 
and  covariances  set  equal  to  those  observed  for  the  two  training  samples  for 
each  of  the  feature  variables  at  two  stations.  There  it  can  be  seen  that  for 
each  of  the  feature  variables,  there  is  a  reasonable  separation  between  the 
two  populations;  and  it  would  be  exi)ected  that  the  outlier  test  should  be 
able  to  detect  explosions  as  outliers  related  to  the  earthquake  population. 
In  order  to  examine  this,  we  performed  the  outlier  tests  using  samples  of 
length  50  simulated  from  the  model  with  =  Xi  and  Si  =  Sj  with 
outliers  simulated  from  the  multivariate  normal  model  with  fX2  =  X2  and 
S2  =  S2-  In  ftis  setting,  an  outlier  was  detected  81.1%,  89.1%,  94.1%  and 
93.9%  of  the  time  by  using  the  separate  station-based,  full  vector,  minimum 
variance  weighting  and  combined  full  vector-minimum  variance  weighting 
tests,  respectively.  We  have  also  examined  the  outlier  tests  of  section  1 
to  determine  their  success  rates  in  classifying  the  70  events  in  the  training 
sample  of  explosions  as  outliers.  The  success  rates^  for  Test  1,  Test  2,  Test 
3  and  the  combined  test  are  92.9%,  92.9%,  77.1%  and  91.4%  respectively. 
Test  1  and  Test  2  agreed  in  classifying  these  70  events  except  for  two  cases. 
In  one  of  these  two  cases  Test  1  does  not  classify  the  event  considered  as  an 
outlier,  and  Mahalanobis  distance  criterion  favors  Test  1.  In  the  second  case, 
Test  2  does  not  classify  the  event  considered  as  an  outlier,  and  Mahalanobis 
distance  criterion  favors  Test  2.  The  Mahalanobis  distance  criterion  favors 
Test  1  51.4%  of  the  time,  and  favors  Test  2  48.6%  of  the  time. 

^The  test  classifies  the  event  from  the  training  sample  of  explosions  as  outlier. 
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3  Tables  and  Figures 

•  Table  1.  Estimated  powers  of  the  outlier  tests  considered  under  the 
setting: 

^  M  with  /)  = -0.25,0.0,0.25,0.50.0.75 

P  1  / 

and  ,  Hi2  =  0,1,2, 3,4.  (The  values  in  the  parantheses  are  the 
number  of  times  that  full  vector  approach  and  minimum  variance  weight¬ 
ing  approach  performed,  respectively,  for  the  combined  test.) 

•  Table  2.  Estimated  powers  of  the  outlier  tests  considered  under  the 
setting: 


Ml 


^  ,  S  =  ^  with  p  =  -0.25,0.0,0.25,0.50,0.75 


and  p,[i^ ,  =  0.1,2, 3,4.  (The  values  in  the  parantheses  are  the 

number  of  times  that  full  vector  approach  and  minimum  variance  weight¬ 
ing  approach  performed,  respectively,  for  the  combined  test.) 


Figure  1.  Contour  plots  of  bivariate  normal  distributions  from 

,  E  3,  ^  1  Q  ,vith  p  =  -0.2.5,0.0.0.25.0.50,0.75  . 


M  = 


0 

0 


•  Figure  2.  Contour  plots  of  bivariate  normal  distributions  from 

0  A 


M  = 


0 


■2p  4 


with  p  =  -0.25.0.0.0.25,0.50.0.75  . 


•  Figure  3.  (Two  stations-one  variable  case)  Contour  plots  of  bivariate 
normal  distributions  with  means  and  covariances  calculated  from  actual 
seismic  data: 


Xi 


Xo 


-0.743  \ 

o  / 

'  0.235 

0.046 

-1.672  J 

5  —  ( 

^  0.046 

0.173 

0.361  \ 

^  / 

^  0.196 

0.0.53 

-0.010  ) 

>  ^2  =  ( 

.  0.053 

0.168 
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•  Figure  4.  (Two  stations-two  variables  case)  Contour  plots  of  bivari¬ 
ate  normal  distributions  of  each  variable  with  means  and  covariances 
calculated  from  actual  seismic  data: 

For  variable  1 ; 

-  f  -0.712  \  ^  _  f  0.242  0.052  \ 

~  V  ^  ”  V  0-052  0.177  ;  ’ 


^  _f  0.362  \  ^  _f  0.197  0.0.54  \ 

■^2  -  -0.032  j  Q  Q54  0.180  J  ' 

For  variable  2; 

^  _  f  -0.992  \  ^  _  f  0.275  0.075  \ 

^1-1^  -1.8-29  J  ’  “  V  0.075  0.227  )  ^ 


^  _(  -0.155  \  ^  _(  0.195  -0.009  \ 

^2-  y  _o_4.53  J  ^  ^2-  y  _o  009  0.207  J  ' 


Table  1.^ 


p  =  -0.25 

Test 

U) 

Mil 

- 05] — 

W2 

0 

1 

2 

3 

4  1 

0 

1 

0.044 

0.129 

0..389 

0.751 

0.949  1 

2 

0.054 

0.137 

0.347 

0.648 

0.877 

3 

0.041 

0.121 

0.367 

0.740 

0.945  1 

1 

1 

0.122 

0.250 

0.560 

0.856 

0.970  1 

2 

0.141 

0.348 

0.641 

0.884 

0.967 

3 

0.117 

0.197 

0.4.30 

0.769 

0.9-52 

2 

1 

0.400 

0.570 

0.793 

0.947 

0.989 

2 

0..360  1 

0.654 

0.891 

0.973 

0.996 

3 

0.390 

0.450 

0.6291 

0.857 

0.966  1 

3 

1 

0.749 

0.848 

0.943 

0.985 

0.999  1 

2 

0.650 

0.882 

0.974 

0.996 

1.000 

3 

0.745 

0.777 

0.863 

0.948 

0.986 

4 

1 

0.951 

0.975 

0.993 

1.000 

1.000 

2 

0.871 

0.970 

0.995 

1.000 

1.000 

3 

0.948 

0.959” 

0.983 

0.996 

1.000 

■^One  table  for  each  p  considered. 
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Table  1.  continues: 


0.999 

(76.3,2.37) 


Table  1.  continues: 


p  =  0.25  Test 


0 


0. 


0.056 


0.045 


2 

3 

0.105  0.245 


0.128  0.378 


0.157  0.415 

(717,283)  (764,236) 


^22  I  0.171  0.389 

0.106  0.238  0.436 

0.117  0.186  0.406 

0.145  0.226  0.453 

(735,265)  (577,423)  (535,465) 


.397 


0.691 


OOT  0.454  0.671 

(789,211)  (554.446)  |  (398,602) 

0.749  I  0.723  0.793 


0.452  0.674  0.668 


0.745  0.800 


0.934 

0.943 

0.857 

0.9-52 

0.9.39 

0.950 

0.9.38 

0.934 

0.959 

(851, 149) 

(700,300) 

(492,508) 

3 

0.756 

0.957 

0.4-58 

0.678 

0.750 

0.950 

0.748 
(824, 176) 

0.939 
(8-55, 145) 

0.727 

0.936 

0.690 

0.845 

0.748 

0.945 

0.765 

(-595,405) 

0.927 

(660,340) 

0.802 

0.949 

0.868 

0.953 

0.805 

0.951 

0.864 

(391.609) 

0.963 

(464,-536) 

0.906 

0.970 

0.9-58 

0.984  1 

0.900 

0.968 

0.941 

(294.706) 

0.983 

(313.687) 

0.974 

0.989 

0.983 

0.997 

0.970 

0.986 

0.982 

(311.689) 

0.996 
(256, 744) 
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Test 

- — 

fj'll 

0 

1 

2 

3 

4 

1 

0.044 

0.161 

0.481 

0.865 

0.992 

2 

0.056 

0.098 

0.213 

0.410 

0.588 

3 

0.046 

0.128 

0.379 

0.752 

0.954 

0.183 

(683,317) 

0.501 
(825, 175) 

0.863 

(903,97) 

0.987 

(936,64) 

0.176 
(698, 302) 


0.489 


0.222 
(464, 536) 


0.371 


0.388 


0.391 


0.377 


0.378 


0.393 


0.446 

(466,534) 


0.474 


0.604 


0.518 


0.742 


0.603 


0.741 


0.772 

(620,380) 


0.961 


0.783 


0.946 


0.953 

(717,283) 


0.733 

0.936 

0.798 

0.909 

0.438 

0.605 

(458.512) 

(231,769) 

1 

0.842 

2 

0.404 

3 

0.737 

4 

0.838 
(892. 108) 

0.736 


0.594 


0.740 


0.794 


0.758  0.822 

(623.377)  (258.742) 


0.775 


0.822 

(251,749) 

0.851 


0.929 


.873 


0.923 

(96,904) 


0.947 


0.948 

(407,593) 

0.954 


.  0.969 


0.962 


0.972 

(136.864) 


1 

0.982 

0.951 

0.935 

0.950 

0.976 

\mmmm 

0.783 

0.917 

0.969 

0.990 

0.940 

0.935 

0.938 

0.957 

0.978 

4 

0.980 

(931,69) 

0.946 

(752,248) 

0.951 

(413,587) 

0.967 

(136,864) 

0.990 

(51,949) 

Table  1.  continues 


1 

2 

3 

4 

0.244 

0.738 

0.989 

1.000 

0.094 

0.185 

0.370 

0.537 

0.127 

0.378 

0.753 

0.956 

0.730 

0.999 

(705.295) 

(882,118) 

(956,44) 

(978,22) 

0.135 

0.425 

0.880 

0.999 

O.ISS 

0.348 

0.545 

0.701 

0.154 

0.370 

0.740 

0.946 

0.204 

0.485 

0.865 

0.994 

(364,636) 

(459,541) 

(683,317) 

(843, 157) 

0.429 

0.413 

0.729 

0.33S 

0.538 

0.741 

0.370 

0.473 

0.749 

■SHI 

0.461 

0.544 

0.804 

0.961 

(466, 534) 

(130,870) 

(224, 776) 

(420,580) 

0.855 

I1KHM 

0.874 

iHom 

0.789 

0.876 

0.964 

(212.788) 

(31.969) 

(69,931) 

0.992 

0.952 

0.939 

■HI 

0.693 

0.862 

0.948 

0.980 

0.947 

0.947 

0.972 

0.945 

0.953 

(430,570) 

(67,933) 

(7,993) 

87 


Table  2.^ 


/-if’ 

IHSHB 

1 

2 

3 

4 

0 

1 

0.044 

0.129 

0.389 

0.751 

0.949 

2 

0.056 

0.166 

0.436 

0.785 

0.953 

3 

0.041 

0.121 

0.367 

0.740 

0.945 

1 

1 

0.061 

0.171 

0.447 

0.802 

0.961 

2 

0.070 

0.238 

0.569 

0.868 

0.971 

3 

0.053 

0.135 

0.383 

0.750 

0.947 

2 

1 

0.122 

0.250 

0.560 

0.8.56 

0.970 

2 

0.106 

0.328 

0.691 

0.926 

0.984 

3 

0.1171 

0.197  1 

0.430 

0.769 

0.9.52 

3 

1 

0.2.50 

imi 

2 

0.169 

0.442 

0.793 

0.958 

0.990 

3 

0.233 

4 

1 

0.400 

0..570 

0.793 

0.947 

■nciSBii 

2 

0.245 

0..563 

0.868 

0.972 

0.997 

3 

0..390' 

0.4-50 

0.629 

0.857 

0.966 

®One  table  for  each  p  considered. 
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Table  2.  continues: 


2 


Xa) 

Hi 

1 

2 

3 

4 

0.122 

0.377 

0.723 

KE^il 

0.154 

0.400 

0.739 

Kl^il 

0.128 

0.376 

0.748 

0.121 

0.369 

0.715 

(684.316) 

(886, 114) 

(966.34) 

0.142 

0.386 

0.7.31 

0.937 

0.203 

0.487 

0.137 

0.381 

HfiBI 

■MMI 

0.141 

0.391 

0.733 

0.937 

(666,334) 

(859, 141) 

(967, 33) 

(997,3)  1 

0.203 

0.446 

0.773 

0.946  1 

0.868 

0.967 

0.758 

0.947 

0.197 

0.445 

0.777 

0.946 

(•558. 442) 

(788,212) 

(944,56) 

(992,8)  1 

0.306 

0..549 

0.824 

0.960  1 

0.326 

0.666 

0.912 

0.977  I 

0.294 

0.494 

0.951  1 

0.264 

0.551 

0.836 

0.962  1 

(447, 553) 

(714,286) 

(893, 107) 

(973,27)  1 

mmm 

0.967  ^ 

0.409 

0.741 

0.938 

0.441 

0.605 

0.838 

0.366 

0.657 

0.879 

0.969 

(332,668) 

(594,406) 

(819, 181) 

(945,55) 

89 


Table  2.  continues: 


0 


0.044 


0.140 


0.399 

0.756 

0.418 

0.760 

0.378 

0.750 

0.130 


0.399 


(680,320)  (878,122) 


0.053 


0.061 


0.044 

(534,466) 


0.122 

0.061 

0.117 

0.067 

(462,538) 


0.129 


0.180 


0.136 

(611,389) 


3 


0.186 


0.155 

(475.525) 


0.377 


0.464 


0.386 

(800,200) 


0..389 


0.517 


0.406 


0.405 

(711,289) 


0.753 

(974,26) 


0.728 


0.793 


0.735 

(954.46) 


0.727 


0.828 


0.748 


0.741 

(886,114) 


0.943 

(990,10) 


0.937 

(974,26) 


0 


0.094 


0.383 

0.414 

0.562 

0.805 

IBHIi 

0.278 

(218,782) 

0.552 

(386,614) 

0.815 

(646,354) 

Table  2.  continues: 


p  =  0.50 

Test 

msm 

— 1 

1 

0 

1 

0.044 

0.161 

2 

0.055 

0.166 

0.128 

0.155 

(656.344) 

1 

0.122 

0.162 

0.135 

4 

0.049 

(492,508) 

0.126  ^ 
(526,474) 

2 

1 

0.149 

0.151 

2 

0.050 

0.166 

3 

0.123  1 

0.180 

4 

0.138 

(390.610) 

3 

1 

0.294 

0.232 

0.169 

0.219 

0.258 

0.154 
(258. 742) 

4 

1 

0.489 

2 

0.057 

■BOH 

3 

0.374 

0.391 

■ 

llli^EHi 

IBBISHI 
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4 


Mil 


2 

3 

4 

0.481 

0.865 

0.828 

mmm 

0.752 

0.486 

(889,111) 

0.864 

(979,21) 

0.992 

(998,2) 

0.407 

0.793 

0.983 

0.492 

0.831 

0.971 

0.374 

0.747 

0.9-50 

0.431 

(784,216) 

0.800 

(932,68) 

0.983 

(987.13) 

0.377 

0.742 

0.961 

0.497 

0.839 

0.972 

0.393 

0.741 

0.946 

0.410 

(602,398) 

0.769 
(833, 167) 

0.963 

(947,-53) 

0.397 

0.723 

0.494 

0.837 

0.439 

0.7-53 

0.474 

0.733 

0.491 

0.828 

0.518 

0.775 

mmsm 

0.454 

(233,767) 

0.770 

(472,528) 

0.951 

(707,293) 

1 

0.238 

i 

2 

0.087 

0.244 

0.738 

0.989 

0.263 

0.734 

0.977 

0.127 

0.378 

0.7.53 

0.244 
(725, 275) 

0.736 

(951,49) 

0.990 

(991.9) 

0.144 

0.559 

0.9-52 

0.188 

0.643 

0.9.58 

0.125  n 

0.369 

■llHI 

0.146 

(.523.477) 

0.578 

(839,161) 

0.954 

(966,34) 

0.135 

0.425 

0.880 

0.128 

0.548 

0.926 

0.154 

0.370 

0.740 

4 


1.000 


1.000 


0.956 


1.000 

(993,7 


0.477 

0.145 

0.219 

0.334 

(644.356) 


0.728 


,11 

656) 


0.229 

0.089 

0.239 

0.102 

(274,726) 


0.459 

(6.36,364) 


0.377 

0.438 

0.402 

0.377 

(358,642) 


(876. 124) 


0.793 

0.S83 

0.740 

0.826 

(688.312) 


0.729 


2 

0.205 

3 

0.365 

0.999 

(969,3: 


0.991 

0.996 

0.945 

0.991 

(898.102) 


0.979 

0.991 


0.945 


Figure  1 .  Contour  plot  of  Bivariate  Normal  Density  Function: 


Figure.2.  Contour  plot  of  Bivariate  Normal  Density  Function: 
mu=(0,0);  s1=1,  s2=4,  rho=-0. 25, 0.0, 0.25, 0.5, 0.75. 


4  Concluding  Remarks 

We  considered  the  problem  of  observing  seismic  events  for  the  purpose  of 
distinguishing  between  earthquakes  and  explosions,  ^^''e  studied  the  case  in 
which  data  are  available  at  more  than  one  station.  We  discussed  three  outlier 
tests,  all  based  on  likelihood  ratio,  for  the  cases  in  which  there  are  p  feature 
variables  at  each  of  m  stations.  The  tests  utilize  the  data  in  different  forms: 
Test  1  treats  p  variables  at  m  stations  as  mp  variables,  Test  2  combines 
the  information  for  each  of  p  variables  at  tti  stations,  Test  3  treats  each 
station  separately.  We  also  discussed  a  combined  test,  Test  4,  which  decides 
to  use  Test  1  or  Test  2  based  on  Mahalanobis  distance  criterion  described  in 
section  1. 

It  seems  that  Test  1  gives  powers  that  are  sometimes  best  and  always 
comparable  to  the  best  powers,  while  for  the  other  tests,  scenarios  existed  in 
which  the  power  was  substantially  lower  than  that  for  Test  1.  Thus,  unless  m 
and  p  are  sufficiently  large  to  make  Test  1,  the  full  vector  test,  impractical, 
we  recommend  its  use. 
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A  New  Test  for  Outlier  Detection  from 
a  Multivariate  Mixture  Distribution 
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ABSTRACT 

The  problem  of  testing  an  outlier  from  a  multivariate  mixture  distribution  of  several 
populations  has  many  important  applications  in  practice.  One  particular  example  is  in  monitoring 
worldwide  nuclear  testing,  where  we  wish  to  detect  whether  an  observed  event  is  possibly  a 
nuclear  explosion  (an  outlier)  by  comparing  it  with  the  training  samples  from  mining  blasts  and 
The  combined  population  of  seismic  events  from  mining  blasts  and  earthquakes  can 
be  viewed  as  a  mixture  of  two  populations.  The  classical  likelihood  ratio  test  appears  to  be  not 
applicable  in  our  problem,  and  in  spite  of  the  importance  of  this  problem,  little  progress  has  been 
made  in  the  literature.  In  this  report  we  propose  a  simple  modified  likelihood  ratio  test  that 
overcomes  the  difficulties  in  the  current  problem.  Bootstrap  techmques  are  used  to  approximate 
the  distribution  of  the  test  statistic.  The  advantages  of  the  new  test  are  demonstrated  via 
simulation  studies.  Some  new  computational  findings  are  also  reported. 
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1.  Introduction 

An  extremely  important  practical  problem  is  that  of  monitoring  worldwide  nuclear  testing, 
where  we  wish  to  detect  whether  an  observed  seismic  event  may  be  a  nuclear  explosion  by 
comparing  it  with  the  training  samples  obtained  from  previous  seismic  activity  in  the  region.  In 
this  case,  the  training  data  will  often  be  composed  of  data  which  are  a  composite  of  mining 
explosions  and  earthquakes.  Usual  methods  of  outlier  detection  typically  focus  on  the  setting  in 
vdiich  observations  are  tested  as  outliers  from  a  single  population.  However,  in  the  case 
considered  here,  there  are  two  populations,  and  we  wish  to  test  whether  a  seismic  event  should  be 
considered  to  be  an  outlier  from  either  or  both  of  the  populations.  Actually,  these  results  are 
applicable  to  two  or  more  populations  but  we  focus  on  the  case  of  two.  Another  point  of  interest 
is  the  fact  that  the  setting  considered  here  differs  from  a  common  outlier  scenario  in  which  a 
sample  is  given  and  the  observations  from  the  sample  are  tested  to  determine  whether  they  should 
be  considered  as  outliers  from  the  population  from  which  the  sample  was  obtained.  This, 
however,  is  not  the  scenario  considered  here.  Specifically,  in  our  setting,  "pure"  samples  from  the 
populations  in  question  are  available,  and  our  desire  is  to  test  a  new  observation  as  an  outlier  from 
these  populations.  We  will  refer  to  this  testing  procedure  as  outlier  testing  throughout  the  report. 

The  classical  method  for  outlier  detection  of  the  type  we  are  addressing  is  the  likelihood 
ratio  test  O^tilks  (1963),  Caroni  and  Prescott  (1992)),  usually  under  the  normality  assumption  for 
the  multivariate  distributions  of  the  training  sample  population  and  the  outlier  population,  and 
under  the  assumption  of  equal  covariance  of  the  two  populations  under  the  alternative  hypothesis. 
The  resulting  test  is  essentially  the  Hotelling's  test  (see  Anderson  (1984)).  In  our  current 
problem,  because  of  the  fact  that  there  is  not  a  single  multivariate  normal  population  associated 
with  the  training  sample,  these  assumptions  are  not  satisfied.  Thus,  a  direct  application  of  the 
standard  likelihood  ratio  test  does  not  seem  possible.  In  spite  of  the  importance  of  this  problem, 
to  our  knowledge  little  progress  has  been  made  in  the  literature.  Baek  et  al.  (1992)  recently 
considered  the  outlier  testing  in  the  seismic  setting  discussed  here  but  in  the  special  case  in  which 
seismic  events  are  tested  as  outliers  from  a  single  population,  usually  earthquakes.  Baek  et  al. 
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(1992)  used  a  bootstr^^  approach  to  ascertain  the  distribution  of  the  likelihood  ratio  when  the 
multivariate  distribution  associated  with  the  training  sample  has  both  continuous  components  and 
discrete  components  that  have  a  finite  number  of  possible  outcomes.  Some  assumptions,  such  as 
equality  of  covariances,  are  imposed  to  link  the  training  sample  population  and  the  outlier 
population.  It  is  possible  to  apply  the  test  of  Baek  et  al.  sequentially  to  each  training  sample 
population,  but  this  can  be  cumbersome,  e.g.  the  training  sample  populations  often  have  different 
covariance  structures.  Furthermore,  this  procedure  would  result  in  substantial  loss  of  power. 

In  this  report  we  consider  an  approach  to  the  practical  problem  at  hand  by  considering 
the  combined  population  of  seismic  events  of  mining  blasts  and  earthquakes  as  a  mixture  of  two 
populations.  We  propose  a  simple  modified  likelihood  ratio  test  using  bootstrap  resampling  that 
appears  to  perform  well  in  this  setting.  The  methodology  is  presented  in  Section  2  for  testing 
outliers  fi'om  a  mixture  population  consisting  of  m  components.  Some  numerical  procedures  are 
addressed,  including  the  use  of  the  bootstrap  for  approximating  the  distribution  of  the  test  statistic 
in  Section  3.  We  also  describe  how  the  intensive  computing  time  required  for  the  bootstrap 
resampling  can  be  reduced  without  loss  of  accuracy  when  the  trmning  sample  size  is  relatively 
large.  Section  4  provides  the  results  of  empirical  studies.  Some  concluding  remarks  are  given  in 
Section  5. 

2.  The  Methodology 

Suppose  we  have  a  mixture  distribution  Hof  m  populations,  Ilj,  i  —  1,...,  m.  In  the 
nuclear  testing  example  mentioned  above,  m  =  2  for  mining  explosions  and  earthquakes.  Let  d 
be  the  dimension  of  the  variables  fi'om  the  mixed  population  11,  and  for  clarity  in  the  presentation 
assume  all  the  distributions  are  continuous.  Note  that  extensions  to  discrete  or  mixed  cases  are 
mainly  a  matter  of  notational  adjustments.  The  density  of  the  mixture  distribution  is 

m 

f{x;  = 

i=l 
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where  p,  >  0  are  mixing  proportions  with  ^  Pi  =  1,  9i  are  the  densities  of  Hi,  6i  are  unknown 

»=i 

parameter  vectors,  0  =  doadx  =  {xi,...,  x^)'.  In  the  nuclear  monitoring 

scenario,  we  wish  to  test  whether  a  new  seismic  event  is  an  outlier  to  the  mixture  of  earthquakes 
and  mining  explosions.  More  generally,  we  wish  to  be  able  to  test  whether  a  new  observation  is 
an  outlier  from  the  mixed  population  11 . 

Assume  that  we  have  a  random  training  sample  of  size  n  from  the  mbcture  population 

6  n, 

and  that  we  are  able  to  identify  the  associated  source  population  for  <  n  members  of  the 
trmning  sample.  For  convenience,  let 


Xk,_,+i,  Xki_,+2, -,Xk,  e  Hi ,  for  i  =  1,  (2) 

where  Q  =  ko  <  km  =  ul,  i.e.,ni  =  fej  -  fcj_i  (normally  >  10)  data  points  are 

identified  to  be  from  11,.  Additionally,  we  allow  for  the  possibility  that  the  training  sample 
contains  nu  unlabeled  observations  from  the  mixture.  In  the  notation  of  Redner  and  Walker 
(1984)  we  assume  the  sample  Xi,  ...,X^  is  of  Type  4,  i.e.  the  training  sample  consists  of  labeled 
and  unlabeled  observations.  The  associated  n^'s,  i  =  1,  ,  m  are  random  variables  following  a 

multinomial  distribution,  and  they  contain  information  about  the  mixing  proportions.  In  this 
notation,  n  =  nL+nu.  If  in  fact  nu  =  0,  then  the  training  sample  consists  of  only  labeled 
observations  and  is  a  sample  of  Type  3  using  the  Redner  and  Walker  notation.  Now  a  new 
observation  Xn+i  is  obtained.  Given  (2)  we  want  to  test  the  following  hypotheses: 

Hq  :  Xn+i  e  n 


(3) 


Hi  :  ^  n . 


The  classical  likelihood  ratio  test  statistic  is  the  ratio  of  the  majdmized  likelihood  functions 
under  Ho  and  Hi.  Under  Ho  the  sample  is  of  Redner  and  Walker  Type  4,  i.e.  we  assume  that 
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Xu...,X^  are  as  before  while  X^+i  is  unlabeled  but  from  the  same  mixture  distribution  as 
Xi,...,  JSC,.  That  is,  we  assume  that  all  n  +  1  observations  are  from  the  mixture  distribution 
assumed  under  Hq  with  ni  of  these  labeled  and  nc;  + 1  unlabeled.  The  likelihood  function  under 
Hois 

xo(9)=  n  n 

Let  h{x;  a)  be  the  density  associated  with  the  outlier  population  from  which  Xn+i  is  sampled, 
where  a  is  an  unknown  parameter  vector.  Then  the  likelihood  function  under  Hi  is 

L,ie.c)=  ,  fn  n  1 1  n 


DifBculties  arise  when  maximizing  Li  since  there  is  only  a  single  observation  from  the  outlier 
population  so  that  generally  no  suitable  MUE  is  possible  for  a,  unless  a  is  assumed  to  directly  hnk 
to  6.  Any  such  linkage  assumption  is  quite  questionable  since  we  now  have  m  individual 
populations  that  make  up  the  mixture  distribution.  Furthermore,  with  only  one  observation  it  is 
impossible  to  do  any  model  checking  of  h(x;  a).  To  overcome  these  difficulties  and  to  observe 
the  fact  that  little  information  is  known  about  the  outlier  population  from  which  Xj  is  sampled, 
we  simply  use  a  constant  density  h{x)  =  c  over  its  practical  (finite)  support.  Moreover,  the 
constant  density  is  also  assumed  in  the  bootstrap  procedure  described  below.  Thus,  dropping  the 
constant  from  the  likelihood  ratio  test  statistic  will  not  affect  any  test  conclusions.  Therefore  we 
let 


L  m  = 


ifn  n  f  n 

m!  ■■■"„!  Vfaij=iT+i  /  V.ir-i-1  J 


which  is  the  likelihood  based  on  the  sample  Xi,...,X^  from  the  mixture.  We  define  a  simple 
modified  likelihood  ratio  test  statistic 
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w  = 


(5) 


supXoW 
gee 

sup  L  i(d) 
g  e  © 

where  ©  is  the  entire  parameter  space.  It  is  easily  seen  that  the  departure  of  Xn+i  from  /  will 

reduce  supLo(^)  making  W  small.  Hence  the  rejection  region  is  of  the  form  W  <  Wq  for  some 
gee 

Wq  picked  to  provide  a  level  a  test.  Since  the  null  distribution  of  W  has  no  known  closed  form, 
we  suggest  the  use  of  the  parametric  bootstrap  method  to  approximate  it,  as  shown  in  the  next 
section.  Based  on  the  discussion  here  the  use  of  W  seems  to  be  a  reasonable  approach,  and  in 
Section  4  we  demonstrate  that  W  performs  well  imder  all  the  simulation  scenarios  considered. 

Concluding  this  section,  we  point  out  that  asymptotically  W  «  /(-^n+iJ  ^n),  as  n  — >  oo, 
where  is  the  MLE  using  the  training  sample  only.  See  the  Appendk  for  the  proof  Moreover, 
the  bootstrap-one  method  described  in  the  next  section  is  essentially  equivalent  to  using  this 
asymptotic  result. 

3.  The  Bootstrap  and  Other  Computational  Procedures 

In  this  section  we  discuss  numerical  issues  associated  with  the  test  procedure  described  in 
Section  3.  It  should  be  noted  that  often  both  the  numerator  and  denominator  of  W  in  (5)  may  be 
difficult  to  obt^  since  the  individual  densities  are  mbcture  distributions.  Recall  also  that  for  the 
numerator  we  assume  that  Xi,...,Xn  can  be  identified  with  their  component  population,  but 
Xn+i  is  only  known  to  be  from  the  mixture,  not  the  exact  component.  However,  if  we  consider 
the  setting  of  multivariate  normality  for  each  component,  i.e., 

9i(x;6i)  ^  N(ui,'Ei),  (6) 

and  thus  f{x;  9)h  a  mixture  of  m  multivariate  normal  distributions,  a  numerical  iteration 

algorithm  based  on  the  EM  algorithm  has  been  developed  by  Redner  and  Walker  (1984),  for 

maximizing  Lo{d).  They  extended  Hosmer's  (1973)  algorithm  for  the  case  of  two  univariate 

normal  components  to  the  multivariate  normal  components  setting,  and  in  our  simulation  studies, 
we  have  adapted  their  method.  Note  that  with  (6),  supL  i(0)  is  easily  obtained.  Using  the 
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resulting  estimator  as  an  initial  value  in  the  numerator,  it  only  takes  at  most  a  few  steps  to 
obtain  convergence. 

We  now  turn  to  bootstrapping  the  null  distribution  of  W.  We  will  employ  the  parametric 
bootstrap  based  on  the  training  sample  Xi,  ...,Xn-  The  following  algorithm  is  used  which 
mimics  the  original  sampling  plan. 

Step  1:  Use  (2)  to  obtain  (p^,  Ui,  E,)  for  i  =  l, m  . 

Step  2;  For  each  integer  b,b  =  l, B,  draw  a  sample  of  size  from  the  multinomial 

distribution  with  p=  (pj,  ....PtoX-  We  observe  the  frequencies  to  TOj^, 

and  TO^^  where  n/i  +  TOg^^  H - h  to  =  tojt  .  Additionally,  we  draw  a  sample 

of  size  nu  from  the  same  multinomial  distribution  resulting  in  frequencies  to 
....andTOj’j^whereTOjJ^  +  +  •••+to^  =nu. 

Step  3;  Draw  samples  of  size  n^^and  from  N{ui,  Ej)  for  i  =  1,  ...,TOt.  The  TO£ 

observations  associated  with  frequencies  to^^,  i  =  1,  m  are  treated  as  labeled 
samples  in  the  analysis,  while  the  nu  observations  corresponding  to  n^U’ 
i  =  l,  m  are  treated  as  unlabeled  observations.  These  resampled  data  are 
used  to  compute  the  test  statistic  in  (5).  This  test  statistic  is 
denoted  by  Wj. 

Step  4:  Draw  a  new,  (to  +  l)st,  observation  from  the  empirical  mixture  by  randomly 

selecting  a  single  observation  from  the  multinomial  distribution  in  Step  2.  This 
multinomial  will  essentially  select  a  component  i  between  1  and  m,  and  we 
generate  an  observation  from  the  associated  JV(to,-,  Ej)  distribution. 

Step  5:  Repeat  Steps  2  to  4  B  times  (6  =  1, B).  Then  define  Wa  to  be  the 
(100a)th  percentile  of  all  Wj  .  Specifically,  if  a  =  j/{B  + 1),  then  Wa  is 
the  ^"th  smallest  value  of  (se©  McLachlan,  1987).  Statistical  decisions 

can  then  be  made. 
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Notice  that  when  n  is  large  the  bootstrap  scheme  may  require  considerable  computing 
time.  However,  when  rn  are  not  very  small,  this  computational  burden  can  be  avoided  by 
employing  an  ^proximate  bootstrap  scheme,  called  bootstr^-one.  This  technique  uses  the 
original  training  sample  in  Steps  2  and  3  for  all  6  =  1, B.  It  effectively  eliminates  these  two 
steps  and  many  calculations  in  obtaining  Wa- 

The  bootstrap-one  method  conceptually  approximates  the  conditional  distribution  of  W 
given  Xi,...,Xn.  When  all  rii  are  relatively  large,  the  conditioning  effect  is  minimal.  The 
accuracy  and  advantages  of  the  bootstrap-one  method  are  among  the  things  studied  in  simulations 
which  are  discussed  in  the  next  section. 


4.  Empirical  Studies 

In  this  section  we  report  some  results  of  a  simulation  study  to  illustrate  the  performances 
of  the  new  methods.  In  these  simulations  we  focus  on  the  case  in  which  all  training  sample 
observations  are  labeled,  i.e.  nu  —  0. 

Example  1.  In  this  example,  we  choose  m  =  1,  d  =  2,  and  n  =  40  so  that  the  training  sample 
is  from  a  bivariate  N{p,T,),  where 

"“(o)’  l) 

were  used.  Obviously,  in  this  case  since  there  is  only  one  component  in  the  "mixture",  all 
observations  in  the  training  sample  can  be  labeled,  i.e.  nu  =  0.  The  reason  for  choosing  m  =  1  is 
that  in  this  case  it  is  easy  to  apply  the  standard  likelihood  ratio  test  assuming  that  the  outlier 
population  is  normal  with  the  same  covariance  E.  In  this  case,  there  is  a  single  training  sample  of 
size  n  and  an  observation  -X’^+jto  be  tested  as  an  outlier.  Baek  et  al.  (1992)  discusses  the 
generalized  likelihood  ratio  test  in  this  setting.  In  particular,  the  likelihood  ratio  statistic  is  given 
by 
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(8) 


sup  Lo{d) 

\  _  ^  €  0 _ 

sup  Li  {6y  a) 
fiGe 

where  Li  (6,  a)  is  given  in  (4)  and  a  is  related  to  0  in  a  certain  way. 

Specifically,  h  is  the  multivariate  normal  density  associated  with  observation  X„+i  and 
a  =  (/i2,S)  is  estimated  by  taking  p.2  =  X„+iand  taking  E  to  be  the  MLE  obtained  from  the 
training  sample  Under  the  normality  assumption  in  this  example,  the  test  statistic  in  (8)  is  known 
to  be  distributed  as  Hotelling's  (e.g.  Anderson,  1984).  Baek  et  al.  (1992)  considered  the 
likelihood  ratio  in  (8),  where  the  multivariate  random  variables  could  be  composed  of  both 
continuous  and  discrete  components.  They  approximated  the  distribution  of  A  in  this  case  using 
the  bootstrap  procedure  described  here.  They  applied  the  bootstrap  procedure  to  the  special  case 
in  which  the  distributions  were  multivariate  normal  and  approximated  the  distribution  of  A  using 
the  bootstrap  procedure.  Simulations  have  shown  that  the  power  of  the  test  based  on  the 
bootstrap  is  very  similar  to  that  obtmned  based  on  Hotelling's  T®  in  the  multivariate  normal  case. 
In  this  report  all  tests  are  based  on  the  use  of  bootstrap  resampling  to  approximate  the  distribution 
of  the  test  statistic.  The  test  based  on  (8)  will  be  called  the  "standard"  likelihood  ratio  test. 

Instead  of  including  Li  (6,  a)  in  the  denominator  of  (8)  in  this  multivariate  normal  setting, 
we  could  have  used  the  test  statistic  ^ven  in  (5)  which  is  based  on  the  use  of  a  constant  density 
h{x)  =  c  for  over  its  support.  The  test  statistic  using  (5)  will  be  termed  the  "modified"  likelihood 
ratio  test.  For  each  of  these  tests,  whenever  we  approximate  the  distribution  of  the  test  statistic 
by  a  fill!  bootstrapping  of  n  + 1  observations,  we  will  refer  to  this  as  the  "full"  procedure. 
Alternatively,  in  each  case  we  also  consider  the  use  of  the  bootstrap-one  technique.  In  Table  1  we 
denote  them  as  "full"  and  "one"  respectively. 

Table  1  summarizes  the  simulation  results  of  the  two  tests.  One  thousand  replications 
were  used  for  each  entry  and  we  used  B  =  499.  The  power  was  obtained  with  ^((2)* 
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as  the  outlier  population.  We  have  experimented  with  other  covariance 

values,  including  that  in  (7),  and  similar  power  patterns  were  observed. 

First,  we  compare  the  standard  and  modified  tests  using  full  bootstrapping.  In  Table  1,  it 
can  be  seen  that  the  significance  levels  for  both  tests  are  close  to  the  nominal  level  of  a  =  .05 
with  the  modified  tests  having  slightly  larger  levels.  Additionally,  the  powers  of  the  two  tests  are 
similar  with  the  modified  tests  having  somewhat  larger  power.  Thus,  the  use  of  W  in  (5),  which 
appropriately  reflects  our  ignorance  about  the  outlier  population,  performs  as  well  as  the  full 
likelihood  ratio. 

Next,  comparing  "One"  columns  to  "Full"  columns,  we  observe  that  the  bootstrap-one  has 
significance  levels  that  are  artificially  high  for  smaller  sample  sizes.  However,  for  large  n  (say  > 
100)  the  significance  levels  are  of  appropriate  size.  For  these  larger  sample  sizes  the  bootstrap- 
one  procedure  tended  to  have  higher  power  than  obtained  using  full  bootstrapping.  Based  on 
these  results  and  the  computational  burden  associated  with  large  n  suggests  that  the  bootstrap- 
one  is  a  viable  alternative.  Finally,  notice  that  the  bootstrap-one  method  is  identical  for  the 
standard  and  new  tests.  In  fact,  the  identity  can  be  shown  analytically  under  normality.  However, 
the  identity  is  not  true  in  general. 


Example  2.  In  this  example  we  consider  the  use  of  the  likelihood  ratio  test  to  test  for  outliers 
from  the  mixture  model  in  (1)  with  m  —  2  and  n  =  60.  Again  we  consider  the  case  in  which 
d  =  2  and  n)7  =  0,  and  specifically,  we  assume  that  the  component  densities  gi  and  are 
multivariate  normal  densities  associated  with  a 


N 


and 
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populations  respectively. 


Case  a:  pi=P2  =  .5. 

We  examine  the  power  of  the  test  for  detecting  outliers  from 

population  where  k  =  l, 9.  In  Figure  1(a)  we  show  data  from  a  mixture  of  two  populations 
with  Pi  =  P2  =  0.5  along  with  5  outliers.  In  Figure  1(b)  we  show  the  same  data  with  individual 
observations  labeled  with  regard  to  the  associated  component  population  or  outlier  population. 
The  outliers  are  indicated  by  solid  dots.  In  Figure  1(c)  we  again  show  the  labeled  data  along  with 
contours  of  the  mfacture  population.  Finally,  in  Figure  1(d)  we  show  means  and  contours  of  the 
two  component  populations  and  of  the  outlier  population.  In  Figure  2  we  show  the  contours  of 
the  mbcture  components  as  in  Figure  1(d)  along  with  the  outlier  means  (1+A:  -  5, 1  -  (fc  -  5))', 
k  =  1,  Also  in  this  figure  we  show  the  contour  of  the  outlier  population  for  the  case 

fc  =  2,  i.e.  the  mean  is  ( -  2,  4)'.  In  Table  2(a)  to  =  60  is  used  and  the  nominal  level  is  a  =  0.05. 
As  can  be  seen,  the  significance  level  is  close  to  the  nominal  level.  Whenever  the  outlier 
population  is  well  separated  from  the  component  distributions  of  the  mbrture  we  have  good  power 
while  as  would  be  expected  the  power  lowers  dramatically  for  fc  near  5.  The  true  powers  for 
fc  =  1, 2, 3,  and  4  are  the  same  as  those  for  fc  =  9, 8, 7  and  6  respectively,  due  to  symmetry.  The 
empirical  results  appear  to  verify  this  fact. 

Case  b:pi  =  0,25  and  P2  =  0.75. 

In  this  case  we  consider  the  same  scenario  as  Case  a  but  with  pi  =  0.25  and  p2  =  0.75. 

In  Figure  3  we  show  the  plots  corresponding  to  Figure  1  for  the  case  in  which  pi  =  0.25  and 
P2  =  0.75,  and  in  Table  2(b)  we  show  results  corresponding  to  those  in  Table  2(a)  for  this  case. 
Again,  we  see  the  significance  levels  are  accurate  and  that  powers  are  similar  to  those  in 
Table  2(a).  It  should  be  noted  that  due  to  smaller  pi  here,  there  was  a  very  small  fraction 
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( <  0.2%)  of  all  bootstrap  simulation  replications  that  did  not  converge  with  our  current 
program.  This  problem  seems  to  become  more  serious  when  smaller  values  of  n  are  used.  In  our 
analysis  we  simply  skip  any  bootstrap  realization  for  which  convergence  was  not  obtained  and 
generate  another  one.  Another  possible  approach  would  be  to  use  the  starting  values  as  final 
estimates  for  these  bootstrap  replications. 

5.  Concluding  Remarks 

In  this  report  we  have  proposed  a  simple  modified  likelihood  ratio  test  for  multivariate 
outlier  detections.  This  new  test  is  not  only  good  for  use  in  general  outlier  detection  problems, 
but  especially  applicable  when  the  traimng  sample  population  is  a  mixture  of  several  populations. 
In  the  new  test  no  assumption  is  necessaiy  for  the  covariance  structure  or  any  other  moments  of 
the  outlier  population,  and  in  fact  no  parametric  modeling  is  required  for  the  outlier  population. 
Furthermore,  although  with  weaker  assumptions  it  is  more  powerful  than  the  standard  likelihood 
ratio  test  in  the  simpler  non  mixture  situation  in  which  the  standard  test  applies. 

We  have  also  investigated  bootstrapping  the  distributions  of  the  test  statistics.  The 
computationally  intensive  resampling  method  seems  to  be  quite  effective.  When  the  training 
sample  size  is  large,  we  have  also  suggested  the  bootstrap-one  method,  which  significantly 
reduces  the  computing  time  and  seems  to  have  somewhat  more  power. 

It  should  be  noted  that  the  procedure  could  be  extended  to  cover  the  case  in  which  all  of 
the  training  sample  observations  are  unlabeled.  This,  however,  will  require  dealing  with  issues 
such  as  the  use  of  appropriate  starting  values  and  is  not  considered  here. 
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APPENDIX 


In  this  appendix,  we  show  that  W  «  f(Xn+i ;  0n)>  as  n  ->  cxd,  where  W  is  given  in  (5).  Let 


7  i{0)  =  ]n{L  my, 

£o{9)  =  \n{Lom  =7  1 W  +  ln{/(X„+i;  0)}.  (Al) 

Suppose ?„  and  $n+i  satisfy  the  conditions  that  L  i(0„)  =  sup  L  i(d)  and  Lo(^n+i)  = 

^  ^  ^  0e9 

sup  XoW.  respectively.  Then  £o(^n+i)  =  0  and  £  ((^n)  =  0.  Thus,  from 
0ee 


£'o(0n+i)  =  Wn)  +  £o(0n)(On+i  “  ^n)  +  Smaller  terms 

=  |[ln{/(X„+i;  0)}]1  (?„)(?„+!  -?n)  +  smaUer  terms, 


we  have 

e„+l-«„  =  Op(i).  (A2) 

since  «o(5n+i)  =  0.  C(^»)  is  of  order  0,{n),  and  |K/(.Sr„+i;  «)}]|^^isOp(l).  Now  by 
(Al)  and  (A2), 

W^=e>cp{fo(«n+l)-f  l(«n)} 

=  exp{f„(?„)  +  fi{?„)(?„+i  -  9„)  +  Op(i)  -7  ,(?„)} 

=  e>tp[ln{/(.S^n+i;«n))l+Op(s) 

=  /(X„«;?„)+Op(l), 

completing  the  proof. 
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Table  1.  Comparisons  of  significant  level  and  power  of  the  standard  likelihood  ratio  test 
and  modified  likelihood  ratio  test,  using  two  (Full  and  One)  bootstrap  approximations. 


n 

Significance  Level 

Power 

Standard 

Modified 

Standard 

Modified 

Full 

One 

Full 

One 

Full 

One 

Full 

One 

15 

.118 

.065 

.118 

.522 

.729 

.568 

.729 

20 

.048 

.100 

.063 

.100 

.541 

.709 

.588 

.709 

25 

.036 

.081 

.048 

.081 

.563 

.704 

.601 

.704 

30 

.047 

.084 

.051 

.084 

.579 

.718 

.609 

.718 

50 

.046 

.064 

.050 

.064 

.626 

.696 

.645 

.696 

100 

.056 

.059 

.057 

.059 

.646 

.677 

.657 

.677 

150 

.059 

.057 

.061 

.057 

.655 

.703 

.665 

.703 

s.e. 

.007 

.015 

Table  2a.  Significance  level  and  power  of  new  test  in  Example  2; 

=  P2  =  0.5,  n  =  60,  B  =  199,  1000  replications 


Level 

.050  (s.e.  .007) 

k 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Power 

1.000 

.984 

.754 

.031 

.231 

.767 

.980 

1.000 

S.e. 

.001 

.004 

.014 

.013 

.006 

.013 

.014 

.004 

.001 

Table  2b.  Significance  level  and  power  of  new  test  in  Example  2; 
Pi  =  0.25,  p2  =  0.75,  n  =  60,  B  =  199,  1000 replications 


Level 

.055  (s.e.  .007) 

k 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Power 

.999 

.970 

.709 

.245 

.042 

.242 

.701 

.972 

.999 

S.e. 

.001 

.005 

.014 

.014 

.006 

.014 

.014 

.005 

.001 
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Figure  2  -  Mixture  distributions  for  Example  2a  showing  means  of  outlier  distributions  in  the  simulations 
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